SlideShare a Scribd company logo
Unit II
BIG DATA
ANALYTICS
Subject Code: CS8091
Regulation : R 2017
1
2
Text Books and References
3
◻ 1. Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive
Datasets”, Cambridge University Press, 2012.
2. David Loshin, “Big Data Analytics: From Strategic Planning to Enterprise
Integration with Tools, Techniques, NoSQL, and Graph”, Morgan
Kaufmann/El sevier Publishers, 2013.
◻ 1. EMC Education Services, “Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, Wiley publishers,
2015.
2. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications”, Wiley Publishers, 2015.
3. Dietmar Jannach and Markus Zanker, “Recommender Systems: An
Introduction”, Cambridge University Press, 2010.
4. Kim H. Pries and Robert Dunnigan, “Big Data Analytics: A Practical
Guide for Managers ” CRC Press, 2015.
5. Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce”, Synthesis Lectures on Human Language Technologies, Vol. 3,
No. 1, Pages 1-177, Morgan Claypool publishers, 2010.
Machine learning
◻ Machine learning is an application of artificial
intelligence (AI) that provides systems to
automatically learn and improve from experience
without being explicitly programmed.
4
Machine learning techniques
5
Contd…
6
Decision trees
K - means
7
Unit II
Clustering and Classification
Classification and clustering are two methods of
pattern identification used in machine learning.
8
9
Definition: Clustering
10
11
Example
◻Assume two clusters Mammal &
🡪
Reptile.
🞑Mammal cluster includes human,
🡪
leopards, elephant, etc.
🞑Reptile cluster includes snakes, lizard,
🡪
komodo dragon etc.
◻Some algorithms used for clustering are
k-means clustering algorithm, Fuzzy c-
means clustering algorithm, Gaussian
(EM) clustering algorithm etc.
12
Example
◻The data points in the graph are
clustered together into 3 clusters.
13
Contd…
◻ It is not necessary for clusters to be a spherical.
14
What are the Uses of
Clustering?
◻Clustering has a variety of uses in a
variety of industries.
🞑Market segmentation
🞑Social network analysis
🞑Search result grouping
🞑Medical imaging
🞑Image segmentation
🞑Anamoly detection
15
Contd…
◻After clustering, each cluster is assigned
a number called cluster ID.
16
Types of clustering
◻Hard clustering: Grouping the data items
such that each item is in only one cluster.
◻Soft (or) overlapping Clustering :
Grouping the data items such that data
items can exist in multiple clusters.
17
Cluster Formation Methods
(or) Clustering Methods
◻Density-based Clustering
◻Distribution-based Clustering
◻Partitioning Methods
◻Hierarchical Clustering
18
Density-based Clustering
◻ Similar dense areas are clustered.
◻ Have good accuracy and ability to merge two
clusters.
🞑Example
■DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
■OPTICS (Ordering Points to Identify Clustering
Structure) etc.
19
Contd…
• These clusters take any arbitrary shapes.
20
Partitioning Methods
◻Partition the objects into k clusters and
each partition forms one cluster.
🞑Example:
■K-means & CLARANS (Clustering Large
Applications based upon Randomized Search)
21
K means clustering
It follow Centroid-based Clustering
◻Organizes the data into non-hierarchical
clusters
🞑Example: k-means clustering algorithm.
22
Distribution-based Clustering
◻This clustering approach assumes data is
composed of distributions, such as
Gaussian distributions.
23
Hierarchical Clustering
◻Creates a tree of clusters. Construct
clusters as a tree-type structure based on
the hierarchy.
◻Two categories
🞑Agglomerative (Bottom-up approach)
🞑Divisive (Top-down approach)
◻Example:
🞑Clustering using Representatives (CURE),
🞑Balanced iterative Reducing Clustering using
Hierarchies (BIRCH), etc.
24
25
Working of Hierarchical clustering
26
Grid-Based method
◻The data space is formulated into a finite
number of cells that form a grid-like
structure
🞑Example
■Statistical Information Grid (STING)
■Clustering in Quest (CLIQUE).
27
K-means clustering algorithm
◻K-Means Clustering is an unsupervised learning
algorithm that is used to solve the clustering
problems in machine learning or data science.
28
What is K-Means Algorithm?
◻K-Means Clustering is an Unsupervised
Learning algorithm, groups unlabeled
🡪
dataset into different clusters.
🞑K Number of pre-defined clusters
🡪
🞑If K=2 Two clusters
🡪
🞑If K=3 Three clusters
🡪
Definition:
It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way each
dataset belongs only one group having similar
properties
29
K-Means Clustering Algorithm
K-Means Algorithm involves following steps:
◻ Step-01:
🞑 Choose the number of clusters K.
◻ Step-02:
🞑 Randomly select any K data points as cluster
centers(centroids).
🞑 Select cluster centers in such a way that they are as farther
as possible from each other.
◻ Step-03:
🞑 Calculate the distance between each data point and each
cluster center.
🞑 The distance may be calculated either by using given
distance function or by using euclidean distance formula.
◻ Step-04:
🞑 Assign each data point to some cluster.
🞑 A data point is assigned to that cluster whose center is
nearest to that data point.
30
Contd…
◻ Step-05:
🞑Re-compute the center of newly formed
clusters.
🞑The center of a cluster is computed by taking
mean of all the data points contained in that
cluster.
◻ Step-06:
🞑Keep repeating the procedure from Steps 3
to 5 until any of the following stopping
criteria is met:
■Center of newly formed clusters do not change
■Data points remain present in the same cluster
■Maximum number of iterations are reached
31
Working of K means
algorithm
Given data points is
pictured as shown.
Step 1:
Let's take number k of clusters, i.e.,
K=2. Choose some random k points
or centroid to form the cluster.
Step 2:
32
Contd…
By applying mathematical
formulas, calculate the distance
between two points. we draw the
median between both the
centroids.
Step 3:
Left side points is near to the K1
or blue centroid and right side
points points are close to the
yellow centroid
Step 4:
33
Contd…
Repeat the process by choosing
a new centroid. Its computed
by calculating the median again.
Step 5:
Reassign each datapoint to
the new centroid.
Step 6:
34
Contd…
we can see, one yellow point is on
the left side of the line, and two blue
points are right to the line. Three
points are assigned new centroids.
Step 7:
Repeat the process by
finding the new centroids
Step 8:
35
Contd…
No dissimilar data points on either
side of the line
Step 9:
Final two clusters is shown
Step 10:
36
Flowchart – k means
37
Merits & Demerits of K-means
Advantages-
◻ K-Means Clustering Algorithm offers the following advantages-
◻ Point-01:
🞑 It is relatively efficient with time complexity O(nkt) where
🞑 n = number of instances
🞑 k = number of clusters
🞑 t = number of iterations
◻ Point-02:
🞑 It often terminates at local optimum.
🞑 Techniques such as Simulated Annealing or Genetic Algorithms
may be used to find the global optimum.
Disadvantages-
◻ K-Means Clustering Algorithm has the following disadvantages-
◻ It requires to specify the number of clusters (k) in advance.
◻ It can not handle noisy data and outliers.
◻ It is not suitable to identify clusters with non-convex shapes.
38
Determining the number of
clusters:
39
◻Elbow method
◻Average silhouette method
◻Gap statistic method
Elbow method
40
◻The Elbow method
🞑Find Total WSS(within-cluster sum of
square)
🞑WSS Number of clusters
🡪
Elbow method - Steps
41
◻Step 1
🞑Compute clustering algorithm (e.g., k-means
clustering) for different values of k.
■ For instance, by varying k from 1 to 10 clusters.
◻Step 2
🞑For each k, calculate the total within-cluster sum
of square (wss).
◻Step 3
🞑Plot the curve of wss according to the number of
clusters k.
◻Step 4
🞑The location of a bend (knee) in the plot is an
indicator of the appropriate number of clusters.
Average silhouette method
42
◻Compute the average silhouette of
observations for different values of k.
🞑Optimal number of clusters k is the one
that maximize the average silhouette
Average silhouette - Steps
43
◻Step 1
🞑Compute clustering algorithm (e.g., k-means
clustering) for different values of k.
■For instance, by varying k from 1 to 10 clusters.
◻Step 2
🞑For each k, calculate the average silhouette of
observations (avg.sil).
◻Step 3
🞑Plot the curve of avg.sil according to the
number of clusters k.
◻Step 4
🞑The location of the maximum is considered as
the appropriate number of clusters.
Gap statistic method
44
◻Compares the total within intra-cluster
variation for different values of k with
their expected values under null
reference distribution of the data.
◻Optimal cluster k is which maximize the
gap statistic (i.e, that yields the largest
gap statistic).
Gap statistic - steps
45
◻ Step 1
🞑 Cluster the data, varying the k = 1, …, kmax, and compute the
corresponding total within intra-cluster variation Wk.
◻ Step 2
🞑 Generate reference data sets with a random uniform distribution.
Cluster each of these reference data sets with varying number of
clusters k = 1, …, kmax, and compute the corresponding total
within intra-cluster variation Wkb.
◻ Step 3
🞑 Compute the estimated gap statistic as the deviation of the
observed Wk value from its expected value Wkb under the null
hypothesis: Gap(k)=1B∑b=1Blog(W∗kb)
−log(Wk)Gap(k)=1B∑b=1Blog(Wkb )−log(Wk) . Compute also
∗
the standard deviation of the statistics.
◻ Step 4
🞑 Choose the number of clusters as the smallest value of k such
that the gap statistic is within one standard deviation of the gap
at k+1: Gap(k)≥Gap(k + 1)−sk + 1.
Comparison – Different clusters
46
Where I can apply k means?
◻Document Classification
◻Delivery Store Optimization
◻Identifying Crime Localities
◻Customer Segmentation
◻Fantasy League Stat Analysis
◻Insurance Fraud Detection
◻Rideshare Data Analysis
◻Cyber-Profiling Criminals
47
Use case 1
◻Categorizing documents based on tags,
topics, and the content of the document is
very difficult.
◻k-means algorithm is very much suitable
algorithm for this purpose. Based on term
frequency, the document vectors are
clustered to help identify similarity in
document groups.
Document Classification
48
Use case 2
◻Optimized path to solve the truck route
which helps the traveling salesman
problem.
Delivery Store Optimization
49
Use case 3
◻Using the category of crime, the area of
the crime and their association helps to
identify crime-prone areas within a city
Identifying Crime Localities
50
Use case 4
◻Clustering helps marketers to segment
customers based on purchase history,
interests, or activity monitoring.
🞑For example, telecom providers can cluster
pre-paid customers by identifying the money
spent in recharging, sending SMS, and
browsing the internet.
🞑It helps the company target specific clusters
of customers for specific campaigns
Customer Segmentation
51
Use case 5
◻Analyzing player stats has always been a
critical element of the sporting world, with
increasing competition.
◻Machine learning plays a major role.
Fantasy League Stat Analysis
52
Use case 6
◻Machine learning plays a critical role fraud
detection and has numerous applications
in automobile, healthcare, and insurance
fraud detection.
🞑Based on past historical data on fraudulent
claims, fraudulent patterns can be easily
identified.
Insurance Fraud Detection
53
Use case 7
◻Using ride information dataset about
traffic, transit time, peak pickup localities
helps the call taxi drivers like uber helps
to plan the cities for the future.
Rideshare Data Analysis
54
Use case 8
◻Cyber-profiling collects data from
🡪
individuals and groups to identify
significant co-relations.
◻These information on the investigation
division to classify the criminals based on
the their types.
Cyber-Profiling Criminals
55
Use case 9
◻Call detail record (CDR) is the information
captured by telecom companies during the
call, SMS, and internet activity of a
customer.
🞑These information helps to know about the
customer’s needs and their usage details.
Call Record Detail Analysis
56
Use case 10
◻Large IT infrastructure technology
components such as network, storage, or
database generate large volumes of alert
messages.
🞑These alert messages potentially point to
operational issues for processing.
🞑Clustering helps to categorize the alerts
🡪
Automatic Clustering of IT Alerts
57
Clustering and Classification
58
Differences between
Classification and Clustering
59
Contd…
60
Classification
◻Classification is the process of learning a
model that categorizes different classes of
data.
◻It’s a two-step process:
◻Learning step
🞑The learning step can be accomplished by
using an already defined training set of data.
◻Prediction step
🞑Predict or classify based on the above
response.
61
Algorithms used for classification
◻Logistic Regression
◻Decision Tree
◻Naive Bayes classifier
◻Support Vector Machines(SVM)
◻Random Forest
62
Decision tree
Decision Tree is a Supervised learning technique
that can be used for both classification and
Regression problems.
63
Definition : Decision tree
◻A decision tree is a tree where
🞑Each node represents a feature(attribute)
🞑Each branch represents a decision(rule)
🞑Each leaf represents an outcome
(categorical or continues value).
64
Idea behind decision tree
◻ Decision Trees usually mimic human like thinking while
making a decision, which is easy to understand.
◻ Create a tree like structure, hence its easily
understood.
◻ Decision trees can handle both categorical and
numerical data.
65
Terminology related to
Decision Trees
◻Root Node:
🞑It represents the entire population or sample or
datasets.
◻Splitting:
🞑It is a process of dividing a node into two or more
sub-nodes.
◻Decision Node:
🞑When a sub-node splits into further sub-nodes,
then it is called the decision node.
◻Leaf / Terminal Node:
🞑Nodes do not split is called Leaf or Terminal node.
66
Terminology contd…
◻Pruning:
🞑It’s the process of removing the unwanted
branches from the tree. Its just opposite of
splitting.
◻Parent/Child node:
🞑The root node of the tree is called the parent
node, and other nodes are called the child
nodes.
67
Contd…
68
Decision tree learning Algorithms
69
◻ID3 (Iterative Dichotomiser 3)
🞑ID3 uses Entropy and information Gain to
construct a decision tree
◻C4.5 (successor of ID3)
◻CART (Classification And Regression Tree)
◻CHAID (CHi-squared Automatic Interaction
Detector).
🞑Performs multi-level splits when computing
classification trees)
◻MARS: extends decision trees to handle
numerical data better.
Working of Decision tree algorithm
◻ Step-1:
🞑 Begin the tree with the root node, says S, which contains the
complete dataset.
◻ Step-2:
🞑 Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
◻ Step-3:
🞑 Divide the S into subsets that contains possible values for the
best attributes.
◻ Step-4:
🞑 Generate the decision tree node, which contains the best
attribute.
◻ Step-5:
🞑 Recursively make new decision trees using the subsets of the
dataset created in step -3.
🞑 Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a
leaf node.
70
Contd…
71
ID3
72
◻Core algorithm for building decision
trees is called ID3.
🞑Developed by J. R. Quinlan,
🞑Employs a top-down, greedy search
🞑Uses Entropy and Information Gain to
construct a decision tree.
Attribute Selection Measures
◻Main issue is how to select the best
attribute for the root node and for sub-
nodes.
◻This is done by a technique called as
Attribute selection measure or ASM.
🞑Information Gain
🞑Gini Index
🞑Gain Ratio
73
Information Gain
◻To understand information gain we should
know Entropy.
◻Define Entropy:
🞑Randomness or uncertainty of a random variable
X is defined by Entropy.
◻For a binary classification problem with only
two classes, positive and negative class.
🞑If all examples are all positive or all are
negative Entropy will be
🡪 zero i.e, low.
🞑If half of the records are of positive class and half
are of negative class Entropy is
🡪 one i.e, high.
74
Entropy value
75
Contd…
◻ By calculating entropy measure of each attribute
we can calculate their information gain.
◻ Information Gain calculates the expected reduction
in entropy due to sorting on the attribute.
76
Training Example
77
Example:
Construct a Decision Tree by using “information gain” as a criterion
78
Given dataset:
Which attribute to select?
79
80
Criterion for attribute selection
81
◻Which is the best attribute?
🞑The one which will result in the smallest
tree
🞑choose the attribute that produces the
“purest” nodes
■Purity improves attribute selection
■Information gain with purity in subsets
Steps to calculate information
gain:
◻Step 1:
🞑Calculate entropy of Target.
◻Step 2:
🞑Entropy for every attribute needs to be
calculated.
🞑Calculate Information Gain using the
following formula.
Information Gain = Target Entropy – (Entropy of all attributes)
82
Contd…
◻To build a decision tree, calculate two
types of entropy using frequency tables
as follows:
🞑Entropy using the frequency table of one
attribute
🞑Entropy using the frequency table of two
attributes
83
Entropy using the frequency
table of one attribute
Note : Use Golf dataset in slideno:77
9 + 5 = 14
5/14 = 0.36 and 9/14 =0.64
84
Entropy using the frequency
table of two attributes
85
◻Formula for calculating the entropy in
splitting process
Entropy using two attributes – contd…
86
Contd…
87
Contd…
88
Contd…
89
90
Choose attribute with the largest information gain.
Divide the dataset by its branches and repeat the process on every branch.
91
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) 0.152 bits
gain(Windy ) 0.048 bits
Select the attribute with the highest gain ratio
Information gain tells us how important the given
attribute is.
Constructing a decision tree is all about finding
attribute that returns the highest information gain.
Selecting the best attribute
92
Continuing to split
93
Split on sunny
94
95
A branch with entropy of 0 is a
leaf node.
96
A branch with entropy more
than 0 needs further splitting.
97
Split on Rainy
98
Final Decision Tree
99
◻The ID3 algorithm is run recursively on the
non-leaf branches, until all data is classified.
Decision Tree to Decision Rules
100
CART (Classification and
Regression Tree)
101
◻Another decision tree algorithm CART
uses the Gini method to create split
points using Gini Index (Gini Impurity)
and Gini Gain.
◻Define Gini Index:
🞑Gini index or Gini impurity measures the
degree or probability of the randomly
chosen variable is wrongly classified.
Purity and impurity
102
◻Pure
🞑All the elements belong to a single class.
◻Gini index varies between 0 and 1
🞑0 denotes all elements belong to a same
🡪
class
🞑1 denotes that the elements are randomly
🡪
distributed across various classes.
🞑0.5 denotes elements are equally
🡪
distributed.
Equation of Gini Index
103
◻Gini index is a metric for classification in
CART.
◻It stores sum of squared probabilities of
each class.
🞑Formula used is:
Taking the same Example:
Construct a Decision Tree by using “Gini Index” as a criterion
104 Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Outlook
105
Play Golf
Yes No
Outlook
Sunny 2 3 5
Overcast 4 0 4
Rainy 3 2 5
14
Outlook
106
Gini(Outlook=Sunny) = 1 – (2/5)2
– (3/5)2
= 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2
– (0/4)2
= 0
Gini(Outlook=Rain) = 1 – (3/5)2
– (2/5)2
= 1 – 0.36 – 0.16 = 0.48
Calculate weighted sum of gini indexes (outlook)
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48
= 0.171 + 0 + 0.171 = 0.342
Temperature
107
Play Golf
Yes No
Temperature
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
14
Temperature
108
Gini(Temp=Hot) = 1 – (2/4)2
– (2/4)2
= 0.5
Gini(Temp=Cool) = 1 – (3/4)2
– (1/4)2
= 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2
– (2/6)2
= 1 – 0.444 – 0.111 = 0.445
Calculate weighted sum of gini indexes (Temperature)
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445
= 0.142 + 0.107 + 0.190 = 0.439
Humidity
109
Play Golf
Yes No
Humidity
High 3 4 7
Normal 6 1 7
14
Gini(Humidity=High) = 1 – (3/7)2
– (4/7)2
= 1 – 0.183 – 0.326 = 0.489
Gini(Humidity=Normal) = 1 – (6/7)2
– (1/7)2
= 1 – 0.734 – 0.02 = 0.244
Calculate weighted sum of gini indexes (Humidity)
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
Wind
110
Play Golf
Yes No
Wind
Weak 6 2 8
Strong 3 3 6
14
Gini(Wind=Weak) = 1 – (6/8)2
– (2/8)2
= 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2
– (3/6)2
= 1 – 0.25 – 0.25 = 0.5
Calculate weighted sum of gini indexes (Humidity)
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
Selecting the attribute
111
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
Minimum
value
Therefore outlook is put at the top of the tree.
Outlook is put at the top
112
Sub dataset in the overcast has only “yes”.
This means that overcast leaf is over.
113
Continuing to split
114
◻Focus on the sub dataset for sunny
outlook.
◻Find the gini index scores for
temperature, humidity and wind features
respectively.
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes
Gini of temperature for
sunny outlook
115
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2
– (2/2)2
= 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2
– (0/1)2
= 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2
– (1/2)2
= 1 – 0.25 – 0.25 = 0.5
Calculate weighted sum of gini indexes (outlook & temperature)
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5
= 0.2
Play Golf
Yes No
Temperature
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
5
Gini of humidity for sunny
outlook
116
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2
– (3/3)2
= 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2
– (0/2)2
= 0
Calculate weighted sum of gini indexes (outlook & humidity)
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Play Golf
Yes No
Humidity
High 0 3 3
Normal 2 0 2
5
Gini of wind for sunny outlook
117
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2
– (2/3)2
= 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2
– (1/2)2
= 0.2
Calculate weighted sum of gini indexes (outlook & wind)
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2
= 0.466
Play Golf
Yes No
Wind
Weak 1 2 3
Strong 1 1 2
5
Decision for sunny outlook
118
Feature Gini index
Temperature 0.2
Humidity 0
Wind 0.466
Minimum
value
Proceed with humidity at the extension of sunny outlook.
Decision for sunny outlook
119
Decision is always NO for high humidity and sunny outlook.
Decision is always be YES for normal humidity and sunny outlook.
Now this branch gets completed.
Decisions for high and normal
humidity
120
Taking the outlook = rain
121
◻ Calculate gini index scores for temperature,
humidity and wind features when outlook is rain.
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No
Gini of temprature for rain
outlook
122
Play Golf
Yes No
Temperature
Cool 1 1 2
Mild 2 1 3
5
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2
– (1/2)2
= 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2
– (1/3)2
= 0.444
Calculate weighted sum of gini indexes (outlook & temperature)
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444
= 0.466
Gini of humidity for rain
outlook
123
Play Golf
Yes No
Humidity
High 1 1 2
Normal 2 1 3
5
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2
– (1/2)2
= 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2
– (1/3)2
= 0.444
Calculate weighted sum of gini indexes (outlook & humidity)
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444
= 0.466
Gini of wind for rain outlook
124
Play Golf
Yes No
Wind
Weak 3 0 3
Strong 0 2 2
5
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2
– (0/3)2
= 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2
– (2/2)2
= 0
Calculate weighted sum of gini indexes (rain outlook & wind)
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
Decision for rain outlook
125
Feature Gini index
Temperature 0.466
Humidity 0.466
Wind 0 Minimum
value
Put the wind feature for rain outlook branch and monitor
the new sub data sets.
Rain outlook – contd…
126
Decision is always YES when wind is weak.
Decision is always NO if wind is strong.
Branch
is over
Final decision tree built by
CART algorithm
127
Differences b/w CART & ID3
128
Evaluating the Decision tree
129
◻Performance Metrics for Classification
Problems
◻Performance Metrics for Regression
Problems
Performance Metrics for
Classification Problems
130
◻Confusion Matrix
◻Classification Accuracy
◻Classification Report
🞑Precision
🞑Recall
🞑F1 score
🞑Specificity
◻Logarithmic Loss
◻Area under Curve
Confusion Matrix
131
◻It is the easiest way to measure the
performance of a classification problem
where the output can be of two or more
type of classes.
◻Define Confusion Matrix:
🞑A confusion matrix is nothing but a table with
two dimensions viz. “Actual” and “Predicted”
🞑The dimensions have “True Positives (TP)”,
“True Negatives (TN)”, “False Positives (FP)”,
“False Negatives (FN)” .
Parameters used in
Confusion matrix
132
True positive and True negatives observations that are
🡪
correctly predicted and are shown in green color.
False positive and False negatives observations that are
🡪
wrongly predicted and are shown in red color. Hence it should be
minimized.
Explanation for the
parameters
133
There are 4 important terms :
◻True Positives :
🞑The cases in which we predicted YES and the
actual output was also YES.
◻True Negatives :
🞑The cases in which we predicted NO and the
actual output was NO.
◻False Positives :
🞑The cases in which we predicted YES and the
actual output was NO.
◻False Negatives :
🞑The cases in which we predicted NO and the
actual output was YES.
Classification Accuracy
134
◻ Classification Accuracy (or) accuracy is the
ratio of number of correct predictions to the
total number of input samples.
Classification Report
135
◻The classification report consists of the
scores of:
🞑Precision
🞑Recall
🞑F1 score
🞑Specificity
Precision
Mainly used in document retrievals
136
◻Precision is defined as the number of
correct documents returned.
Recall
137
Recall is defined as the number of
positives returned.
Specificity
138
◻Specificity, in contrast to recall, is
defined as the number of negatives
returned.
F1 Score
139
◻F1 score is the weighted average of
the precision and recall.
🞑Best value of F1 1
🡪
🞑Worst value of F1 0
🡪
Formulas :
140
AUC (Area Under ROC curve)
141
◻AUC (Area Under Curve) - ROC
(Receiver Operating Characteristic) is a
performance metric, based on varying
threshold values, for classification
problems.
🞑ROC is a probability curve
🞑AUC measure the separability.
◻Higher the AUC, better the model.
AUC - ROC Curve
142
LOGLOSS (Logarithmic Loss)
Logistic regression loss (or) cross-entropy loss
143
◻Accuracy is the count of predictions
whereas Log Loss is the amount of
uncertainty of the prediction.
◻Formula
🞑L(pi)=−log(pi)
🞑 p is the probability attributed to the real
class.
Performance Metrics for
Regression Problems
144
◻The MSE, MAE, RMSE, and R-Squared
metrics are mainly used to evaluate the
performance in regression analysis.
🞑Mean Absolute Error (MAE)
🞑Mean Square Error (MSE)
🞑R Squared (R2
)
Formulas
145
MSE, MAE, RMSE, R2
146
◻R-squared (Coefficient of determination)
🞑Its the coefficient of how well the values fit
compared to the original values.
■R value range from 0 to 1
■Higher value Model is good.
🡪
MAE (Mean absolute error)
147
◻MAE is the absolute difference between
the target value and the value predicted
by the model.
MSE (Mean Squared Error)
Most preferred metrics for regression tasks
148
◻It is simply the average of the squared
difference between the target value and
the value predicted by the regression
model.
RMSE (Root Mean Squared Error)
149
◻It is the error rate by the square root of MSE.
R-squared (Coefficient of
determination)
150
◻It represents the coefficient of how well
the values fit compared to the original
values.
■R value range from 0 to 1
■Higher value Model is good.
🡪
Decision trees in R
151
◻Decision tree is a graph to represent
choices and their results in form of a tree.
🞑Nodes in the graph represent an event or
🡪
choice
🞑Edges of the graph represent the decision
🡪
rules or conditions.
◻Decision trees mostly used in Machine
🡪
Learning and Data Mining applications using
R
🞑 The R package "party" is used to create decision trees.
Commands in R
152
◻R command to install the package
◻To creating a decision tree
🞑formula is a formula describing the
predictor and response variables.
🞑data is the name of the data set used.
install.packages("party")
ctree(formula, data)
Commands in R – Contd…
153
◻Input Data
🞑Take in-built data set named readingSkills to
create a decision tree.
🞑 Taking the variables "age",“ shoesize",
"score" and check whether the given person is
a native speaker or not.
# Load the party package.
# It will automatically load other dependent packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))
Contd…
154
◻By executing the above code, it
produces the following result and chart.
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................
ctree() function to create the
decision tree and see its graph
155
# Load the party package. It will automatically load other
# dependent packages.
library(party)
# Create the input data frame.
input.dat <- readingSkills[c(1:105),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
# Plot the tree.
plot(output.tree)
# Save the file.
dev.off()
By executing the above code,
it produces the following result
156
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: sandwich
Decision tree
157
Conclusion:
ReadingSkills score is less than 38.3 and age is more than 6 is not a
native Speaker.
Naïve Bayes Algorithm
158
◻Naive Bayes is one of the powerful
machine learning algorithms that is used
for classification.
◻It is an extension of the Bayes theorem
wherein each feature assumes
INDEPENDENCE.
Bayes’ Theorem
159
◻Naive Bayes classifiers is a collection of
classification algorithms based on
Bayes’ Theorem.
Baye’s Theorem:
It is a family of algorithms where all of them share
a common principle, i.e. every pair of features
being classified is independent of each other.
Baye’s Algorithm – contd…
160
◻Bayes theorem provides a way of
calculating the posterior probability,
P(c|x), from P(c), P(x), and P(x|c).
◻Class conditional independence.
🞑Naive Bayes classifier assume that the
effect of the value of a predictor (x) on a
given class (c) is independent of the values
of other predictors.
Formula used in Baye’s
161
◻ P(c|x) is the posterior probability of class (target) given predictor
(attribute).
◻ P(c) is the prior probability of class.
◻ P(x|c) is the likelihood which is the probability of predictor given
class.
◻ P(x) is the prior probability of predictor.
How Naive Bayes algorithm works?
Given dataset:
162
Steps:
163
◻Step 1
🞑The posterior probability is calculated
constructing a frequency table for each
attribute against the target.
◻Step 2
🞑Transform the frequency tables to likelihood
tables
◻Step 3
🞑Use the Naive Bayesian equation to calculate
the posterior probability for each class.
🞑Class with the highest posterior probability is
the outcome of prediction.
164
Step 1
Calculate posterior probability using frequency table
165
Step 2
The likelihood tables for all four predictors
166
Step 3
Here, 4 inputs and 1 target
Final posterior probabilities can be standardized between 0 and 1.
Merits of Naive Bayes algorithm
167
◻Naive Bayes algorithm merits:
🞑Easy and quick way to predict the class of
the dataset.
■Hence multi-class prediction is performed
easily.
🞑When the assumption of independence is
valid, Naive Bayes is much more capable
than the other algorithms like logistic
regression.
🞑Only less training data is required
Demerits of Naive Bayes
algorithm
168
◻Assumption: class conditional
independence, hence less accuracy
◻Practically , dependencies exist among
variables.
🞑Ex : hospitals, patients, profile, age, family
history etc.
🞑Symptoms: fever, cough etc. Disease, lung
cancer, diabetes etc.
◻Dependencies among these cannot be
modeled by Naive Bayes classifier.

More Related Content

PPTX
machine learning - Clustering in R
PPTX
Unsupervised learning Algorithms and Assumptions
PPT
Chapter 10 ClusBasic ppt file for clear understaning
PPT
Chapter -10-Clus_Basic.ppt -DataMinning
PPT
15857 cse422 unsupervised-learning
PPT
UniT_A_Clustering machine learning .ppt
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PPTX
K-Means clustring @jax
machine learning - Clustering in R
Unsupervised learning Algorithms and Assumptions
Chapter 10 ClusBasic ppt file for clear understaning
Chapter -10-Clus_Basic.ppt -DataMinning
15857 cse422 unsupervised-learning
UniT_A_Clustering machine learning .ppt
Unsupervised%20Learninffffg (2).pptx. application
K-Means clustring @jax

Similar to big data analytics unit 2 notes for study (20)

PPTX
Unsupervised Learning: Clustering
PDF
Unsupervised learning clustering
PPTX
Presentation on K-Means Clustering
PPT
Clustering in Machine Learning: A Brief Overview.ppt
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PPTX
Unsupervised Learning.pptx
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
26-Clustering MTech-2017.ppt
PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
DOCX
8.clustering algorithm.k means.em algorithm
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PDF
clustering using different methods in .pdf
PPT
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
PPTX
K means clustring @jax
PPT
CS8091_BDA_Unit_II_Clustering
PPT
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
PPTX
Cluster Analysis.pptx
Unsupervised Learning: Clustering
Unsupervised learning clustering
Presentation on K-Means Clustering
Clustering in Machine Learning: A Brief Overview.ppt
A Study of Efficiency Improvements Technique for K-Means Algorithm
Unsupervised Learning.pptx
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
26-Clustering MTech-2017.ppt
[ML]-Unsupervised-learning_Unit2.ppt.pdf
8.clustering algorithm.k means.em algorithm
unitvclusteranalysis-221214135407-1956d6ef.pptx
K MEANS CLUSTERING - UNSUPERVISED LEARNING
clustering using different methods in .pdf
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
K means clustring @jax
CS8091_BDA_Unit_II_Clustering
K_MeansK_MeansK_MeansK_MeansK_MeansK_MeansK_Means.ppt
Cluster Analysis.pptx
Ad

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Welding lecture in detail for understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
web development for engineering and engineering
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Digital Logic Computer Design lecture notes
PPTX
Sustainable Sites - Green Building Construction
PDF
composite construction of structures.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
PPT on Performance Review to get promotions
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Welding lecture in detail for understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
web development for engineering and engineering
CH1 Production IntroductoryConcepts.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Digital Logic Computer Design lecture notes
Sustainable Sites - Green Building Construction
composite construction of structures.pdf
Geodesy 1.pptx...............................................
Foundation to blockchain - A guide to Blockchain Tech
Operating System & Kernel Study Guide-1 - converted.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT on Performance Review to get promotions
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Ad

big data analytics unit 2 notes for study

  • 1. Unit II BIG DATA ANALYTICS Subject Code: CS8091 Regulation : R 2017 1
  • 2. 2
  • 3. Text Books and References 3 ◻ 1. Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive Datasets”, Cambridge University Press, 2012. 2. David Loshin, “Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph”, Morgan Kaufmann/El sevier Publishers, 2013. ◻ 1. EMC Education Services, “Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data”, Wiley publishers, 2015. 2. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data Science and its Applications”, Wiley Publishers, 2015. 3. Dietmar Jannach and Markus Zanker, “Recommender Systems: An Introduction”, Cambridge University Press, 2010. 4. Kim H. Pries and Robert Dunnigan, “Big Data Analytics: A Practical Guide for Managers ” CRC Press, 2015. 5. Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce”, Synthesis Lectures on Human Language Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool publishers, 2010.
  • 4. Machine learning ◻ Machine learning is an application of artificial intelligence (AI) that provides systems to automatically learn and improve from experience without being explicitly programmed. 4
  • 8. Unit II Clustering and Classification Classification and clustering are two methods of pattern identification used in machine learning. 8
  • 9. 9
  • 11. 11
  • 12. Example ◻Assume two clusters Mammal & 🡪 Reptile. 🞑Mammal cluster includes human, 🡪 leopards, elephant, etc. 🞑Reptile cluster includes snakes, lizard, 🡪 komodo dragon etc. ◻Some algorithms used for clustering are k-means clustering algorithm, Fuzzy c- means clustering algorithm, Gaussian (EM) clustering algorithm etc. 12
  • 13. Example ◻The data points in the graph are clustered together into 3 clusters. 13
  • 14. Contd… ◻ It is not necessary for clusters to be a spherical. 14
  • 15. What are the Uses of Clustering? ◻Clustering has a variety of uses in a variety of industries. 🞑Market segmentation 🞑Social network analysis 🞑Search result grouping 🞑Medical imaging 🞑Image segmentation 🞑Anamoly detection 15
  • 16. Contd… ◻After clustering, each cluster is assigned a number called cluster ID. 16
  • 17. Types of clustering ◻Hard clustering: Grouping the data items such that each item is in only one cluster. ◻Soft (or) overlapping Clustering : Grouping the data items such that data items can exist in multiple clusters. 17
  • 18. Cluster Formation Methods (or) Clustering Methods ◻Density-based Clustering ◻Distribution-based Clustering ◻Partitioning Methods ◻Hierarchical Clustering 18
  • 19. Density-based Clustering ◻ Similar dense areas are clustered. ◻ Have good accuracy and ability to merge two clusters. 🞑Example ■DBSCAN (Density-Based Spatial Clustering of Applications with Noise) ■OPTICS (Ordering Points to Identify Clustering Structure) etc. 19
  • 20. Contd… • These clusters take any arbitrary shapes. 20
  • 21. Partitioning Methods ◻Partition the objects into k clusters and each partition forms one cluster. 🞑Example: ■K-means & CLARANS (Clustering Large Applications based upon Randomized Search) 21
  • 22. K means clustering It follow Centroid-based Clustering ◻Organizes the data into non-hierarchical clusters 🞑Example: k-means clustering algorithm. 22
  • 23. Distribution-based Clustering ◻This clustering approach assumes data is composed of distributions, such as Gaussian distributions. 23
  • 24. Hierarchical Clustering ◻Creates a tree of clusters. Construct clusters as a tree-type structure based on the hierarchy. ◻Two categories 🞑Agglomerative (Bottom-up approach) 🞑Divisive (Top-down approach) ◻Example: 🞑Clustering using Representatives (CURE), 🞑Balanced iterative Reducing Clustering using Hierarchies (BIRCH), etc. 24
  • 25. 25
  • 26. Working of Hierarchical clustering 26
  • 27. Grid-Based method ◻The data space is formulated into a finite number of cells that form a grid-like structure 🞑Example ■Statistical Information Grid (STING) ■Clustering in Quest (CLIQUE). 27
  • 28. K-means clustering algorithm ◻K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. 28
  • 29. What is K-Means Algorithm? ◻K-Means Clustering is an Unsupervised Learning algorithm, groups unlabeled 🡪 dataset into different clusters. 🞑K Number of pre-defined clusters 🡪 🞑If K=2 Two clusters 🡪 🞑If K=3 Three clusters 🡪 Definition: It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way each dataset belongs only one group having similar properties 29
  • 30. K-Means Clustering Algorithm K-Means Algorithm involves following steps: ◻ Step-01: 🞑 Choose the number of clusters K. ◻ Step-02: 🞑 Randomly select any K data points as cluster centers(centroids). 🞑 Select cluster centers in such a way that they are as farther as possible from each other. ◻ Step-03: 🞑 Calculate the distance between each data point and each cluster center. 🞑 The distance may be calculated either by using given distance function or by using euclidean distance formula. ◻ Step-04: 🞑 Assign each data point to some cluster. 🞑 A data point is assigned to that cluster whose center is nearest to that data point. 30
  • 31. Contd… ◻ Step-05: 🞑Re-compute the center of newly formed clusters. 🞑The center of a cluster is computed by taking mean of all the data points contained in that cluster. ◻ Step-06: 🞑Keep repeating the procedure from Steps 3 to 5 until any of the following stopping criteria is met: ■Center of newly formed clusters do not change ■Data points remain present in the same cluster ■Maximum number of iterations are reached 31
  • 32. Working of K means algorithm Given data points is pictured as shown. Step 1: Let's take number k of clusters, i.e., K=2. Choose some random k points or centroid to form the cluster. Step 2: 32
  • 33. Contd… By applying mathematical formulas, calculate the distance between two points. we draw the median between both the centroids. Step 3: Left side points is near to the K1 or blue centroid and right side points points are close to the yellow centroid Step 4: 33
  • 34. Contd… Repeat the process by choosing a new centroid. Its computed by calculating the median again. Step 5: Reassign each datapoint to the new centroid. Step 6: 34
  • 35. Contd… we can see, one yellow point is on the left side of the line, and two blue points are right to the line. Three points are assigned new centroids. Step 7: Repeat the process by finding the new centroids Step 8: 35
  • 36. Contd… No dissimilar data points on either side of the line Step 9: Final two clusters is shown Step 10: 36
  • 37. Flowchart – k means 37
  • 38. Merits & Demerits of K-means Advantages- ◻ K-Means Clustering Algorithm offers the following advantages- ◻ Point-01: 🞑 It is relatively efficient with time complexity O(nkt) where 🞑 n = number of instances 🞑 k = number of clusters 🞑 t = number of iterations ◻ Point-02: 🞑 It often terminates at local optimum. 🞑 Techniques such as Simulated Annealing or Genetic Algorithms may be used to find the global optimum. Disadvantages- ◻ K-Means Clustering Algorithm has the following disadvantages- ◻ It requires to specify the number of clusters (k) in advance. ◻ It can not handle noisy data and outliers. ◻ It is not suitable to identify clusters with non-convex shapes. 38
  • 39. Determining the number of clusters: 39 ◻Elbow method ◻Average silhouette method ◻Gap statistic method
  • 40. Elbow method 40 ◻The Elbow method 🞑Find Total WSS(within-cluster sum of square) 🞑WSS Number of clusters 🡪
  • 41. Elbow method - Steps 41 ◻Step 1 🞑Compute clustering algorithm (e.g., k-means clustering) for different values of k. ■ For instance, by varying k from 1 to 10 clusters. ◻Step 2 🞑For each k, calculate the total within-cluster sum of square (wss). ◻Step 3 🞑Plot the curve of wss according to the number of clusters k. ◻Step 4 🞑The location of a bend (knee) in the plot is an indicator of the appropriate number of clusters.
  • 42. Average silhouette method 42 ◻Compute the average silhouette of observations for different values of k. 🞑Optimal number of clusters k is the one that maximize the average silhouette
  • 43. Average silhouette - Steps 43 ◻Step 1 🞑Compute clustering algorithm (e.g., k-means clustering) for different values of k. ■For instance, by varying k from 1 to 10 clusters. ◻Step 2 🞑For each k, calculate the average silhouette of observations (avg.sil). ◻Step 3 🞑Plot the curve of avg.sil according to the number of clusters k. ◻Step 4 🞑The location of the maximum is considered as the appropriate number of clusters.
  • 44. Gap statistic method 44 ◻Compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. ◻Optimal cluster k is which maximize the gap statistic (i.e, that yields the largest gap statistic).
  • 45. Gap statistic - steps 45 ◻ Step 1 🞑 Cluster the data, varying the k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wk. ◻ Step 2 🞑 Generate reference data sets with a random uniform distribution. Cluster each of these reference data sets with varying number of clusters k = 1, …, kmax, and compute the corresponding total within intra-cluster variation Wkb. ◻ Step 3 🞑 Compute the estimated gap statistic as the deviation of the observed Wk value from its expected value Wkb under the null hypothesis: Gap(k)=1B∑b=1Blog(W∗kb) −log(Wk)Gap(k)=1B∑b=1Blog(Wkb )−log(Wk) . Compute also ∗ the standard deviation of the statistics. ◻ Step 4 🞑 Choose the number of clusters as the smallest value of k such that the gap statistic is within one standard deviation of the gap at k+1: Gap(k)≥Gap(k + 1)−sk + 1.
  • 47. Where I can apply k means? ◻Document Classification ◻Delivery Store Optimization ◻Identifying Crime Localities ◻Customer Segmentation ◻Fantasy League Stat Analysis ◻Insurance Fraud Detection ◻Rideshare Data Analysis ◻Cyber-Profiling Criminals 47
  • 48. Use case 1 ◻Categorizing documents based on tags, topics, and the content of the document is very difficult. ◻k-means algorithm is very much suitable algorithm for this purpose. Based on term frequency, the document vectors are clustered to help identify similarity in document groups. Document Classification 48
  • 49. Use case 2 ◻Optimized path to solve the truck route which helps the traveling salesman problem. Delivery Store Optimization 49
  • 50. Use case 3 ◻Using the category of crime, the area of the crime and their association helps to identify crime-prone areas within a city Identifying Crime Localities 50
  • 51. Use case 4 ◻Clustering helps marketers to segment customers based on purchase history, interests, or activity monitoring. 🞑For example, telecom providers can cluster pre-paid customers by identifying the money spent in recharging, sending SMS, and browsing the internet. 🞑It helps the company target specific clusters of customers for specific campaigns Customer Segmentation 51
  • 52. Use case 5 ◻Analyzing player stats has always been a critical element of the sporting world, with increasing competition. ◻Machine learning plays a major role. Fantasy League Stat Analysis 52
  • 53. Use case 6 ◻Machine learning plays a critical role fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. 🞑Based on past historical data on fraudulent claims, fraudulent patterns can be easily identified. Insurance Fraud Detection 53
  • 54. Use case 7 ◻Using ride information dataset about traffic, transit time, peak pickup localities helps the call taxi drivers like uber helps to plan the cities for the future. Rideshare Data Analysis 54
  • 55. Use case 8 ◻Cyber-profiling collects data from 🡪 individuals and groups to identify significant co-relations. ◻These information on the investigation division to classify the criminals based on the their types. Cyber-Profiling Criminals 55
  • 56. Use case 9 ◻Call detail record (CDR) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. 🞑These information helps to know about the customer’s needs and their usage details. Call Record Detail Analysis 56
  • 57. Use case 10 ◻Large IT infrastructure technology components such as network, storage, or database generate large volumes of alert messages. 🞑These alert messages potentially point to operational issues for processing. 🞑Clustering helps to categorize the alerts 🡪 Automatic Clustering of IT Alerts 57
  • 61. Classification ◻Classification is the process of learning a model that categorizes different classes of data. ◻It’s a two-step process: ◻Learning step 🞑The learning step can be accomplished by using an already defined training set of data. ◻Prediction step 🞑Predict or classify based on the above response. 61
  • 62. Algorithms used for classification ◻Logistic Regression ◻Decision Tree ◻Naive Bayes classifier ◻Support Vector Machines(SVM) ◻Random Forest 62
  • 63. Decision tree Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems. 63
  • 64. Definition : Decision tree ◻A decision tree is a tree where 🞑Each node represents a feature(attribute) 🞑Each branch represents a decision(rule) 🞑Each leaf represents an outcome (categorical or continues value). 64
  • 65. Idea behind decision tree ◻ Decision Trees usually mimic human like thinking while making a decision, which is easy to understand. ◻ Create a tree like structure, hence its easily understood. ◻ Decision trees can handle both categorical and numerical data. 65
  • 66. Terminology related to Decision Trees ◻Root Node: 🞑It represents the entire population or sample or datasets. ◻Splitting: 🞑It is a process of dividing a node into two or more sub-nodes. ◻Decision Node: 🞑When a sub-node splits into further sub-nodes, then it is called the decision node. ◻Leaf / Terminal Node: 🞑Nodes do not split is called Leaf or Terminal node. 66
  • 67. Terminology contd… ◻Pruning: 🞑It’s the process of removing the unwanted branches from the tree. Its just opposite of splitting. ◻Parent/Child node: 🞑The root node of the tree is called the parent node, and other nodes are called the child nodes. 67
  • 69. Decision tree learning Algorithms 69 ◻ID3 (Iterative Dichotomiser 3) 🞑ID3 uses Entropy and information Gain to construct a decision tree ◻C4.5 (successor of ID3) ◻CART (Classification And Regression Tree) ◻CHAID (CHi-squared Automatic Interaction Detector). 🞑Performs multi-level splits when computing classification trees) ◻MARS: extends decision trees to handle numerical data better.
  • 70. Working of Decision tree algorithm ◻ Step-1: 🞑 Begin the tree with the root node, says S, which contains the complete dataset. ◻ Step-2: 🞑 Find the best attribute in the dataset using Attribute Selection Measure (ASM). ◻ Step-3: 🞑 Divide the S into subsets that contains possible values for the best attributes. ◻ Step-4: 🞑 Generate the decision tree node, which contains the best attribute. ◻ Step-5: 🞑 Recursively make new decision trees using the subsets of the dataset created in step -3. 🞑 Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node. 70
  • 72. ID3 72 ◻Core algorithm for building decision trees is called ID3. 🞑Developed by J. R. Quinlan, 🞑Employs a top-down, greedy search 🞑Uses Entropy and Information Gain to construct a decision tree.
  • 73. Attribute Selection Measures ◻Main issue is how to select the best attribute for the root node and for sub- nodes. ◻This is done by a technique called as Attribute selection measure or ASM. 🞑Information Gain 🞑Gini Index 🞑Gain Ratio 73
  • 74. Information Gain ◻To understand information gain we should know Entropy. ◻Define Entropy: 🞑Randomness or uncertainty of a random variable X is defined by Entropy. ◻For a binary classification problem with only two classes, positive and negative class. 🞑If all examples are all positive or all are negative Entropy will be 🡪 zero i.e, low. 🞑If half of the records are of positive class and half are of negative class Entropy is 🡪 one i.e, high. 74
  • 76. Contd… ◻ By calculating entropy measure of each attribute we can calculate their information gain. ◻ Information Gain calculates the expected reduction in entropy due to sorting on the attribute. 76
  • 78. Example: Construct a Decision Tree by using “information gain” as a criterion 78 Given dataset:
  • 79. Which attribute to select? 79
  • 80. 80
  • 81. Criterion for attribute selection 81 ◻Which is the best attribute? 🞑The one which will result in the smallest tree 🞑choose the attribute that produces the “purest” nodes ■Purity improves attribute selection ■Information gain with purity in subsets
  • 82. Steps to calculate information gain: ◻Step 1: 🞑Calculate entropy of Target. ◻Step 2: 🞑Entropy for every attribute needs to be calculated. 🞑Calculate Information Gain using the following formula. Information Gain = Target Entropy – (Entropy of all attributes) 82
  • 83. Contd… ◻To build a decision tree, calculate two types of entropy using frequency tables as follows: 🞑Entropy using the frequency table of one attribute 🞑Entropy using the frequency table of two attributes 83
  • 84. Entropy using the frequency table of one attribute Note : Use Golf dataset in slideno:77 9 + 5 = 14 5/14 = 0.36 and 9/14 =0.64 84
  • 85. Entropy using the frequency table of two attributes 85 ◻Formula for calculating the entropy in splitting process
  • 86. Entropy using two attributes – contd… 86
  • 90. 90 Choose attribute with the largest information gain. Divide the dataset by its branches and repeat the process on every branch.
  • 91. 91 gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) 0.152 bits gain(Windy ) 0.048 bits Select the attribute with the highest gain ratio Information gain tells us how important the given attribute is. Constructing a decision tree is all about finding attribute that returns the highest information gain.
  • 92. Selecting the best attribute 92
  • 95. 95
  • 96. A branch with entropy of 0 is a leaf node. 96
  • 97. A branch with entropy more than 0 needs further splitting. 97
  • 99. Final Decision Tree 99 ◻The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
  • 100. Decision Tree to Decision Rules 100
  • 101. CART (Classification and Regression Tree) 101 ◻Another decision tree algorithm CART uses the Gini method to create split points using Gini Index (Gini Impurity) and Gini Gain. ◻Define Gini Index: 🞑Gini index or Gini impurity measures the degree or probability of the randomly chosen variable is wrongly classified.
  • 102. Purity and impurity 102 ◻Pure 🞑All the elements belong to a single class. ◻Gini index varies between 0 and 1 🞑0 denotes all elements belong to a same 🡪 class 🞑1 denotes that the elements are randomly 🡪 distributed across various classes. 🞑0.5 denotes elements are equally 🡪 distributed.
  • 103. Equation of Gini Index 103 ◻Gini index is a metric for classification in CART. ◻It stores sum of squared probabilities of each class. 🞑Formula used is:
  • 104. Taking the same Example: Construct a Decision Tree by using “Gini Index” as a criterion 104 Day Outlook Temp. Humidity Wind Decision 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No
  • 105. Outlook 105 Play Golf Yes No Outlook Sunny 2 3 5 Overcast 4 0 4 Rainy 3 2 5 14
  • 106. Outlook 106 Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48 Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0 Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48 Calculate weighted sum of gini indexes (outlook) Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
  • 107. Temperature 107 Play Golf Yes No Temperature Hot 2 2 4 Cool 3 1 4 Mild 4 2 6 14
  • 108. Temperature 108 Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5 Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375 Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445 Calculate weighted sum of gini indexes (Temperature) Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
  • 109. Humidity 109 Play Golf Yes No Humidity High 3 4 7 Normal 6 1 7 14 Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489 Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244 Calculate weighted sum of gini indexes (Humidity) Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367
  • 110. Wind 110 Play Golf Yes No Wind Weak 6 2 8 Strong 3 3 6 14 Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375 Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5 Calculate weighted sum of gini indexes (Humidity) Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428
  • 111. Selecting the attribute 111 Feature Gini index Outlook 0.342 Temperature 0.439 Humidity 0.367 Wind 0.428 Minimum value Therefore outlook is put at the top of the tree.
  • 112. Outlook is put at the top 112
  • 113. Sub dataset in the overcast has only “yes”. This means that overcast leaf is over. 113
  • 114. Continuing to split 114 ◻Focus on the sub dataset for sunny outlook. ◻Find the gini index scores for temperature, humidity and wind features respectively. Day Outlook Temp. Humidity Wind Decision 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 11 Sunny Mild Normal Strong Yes
  • 115. Gini of temperature for sunny outlook 115 Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0 Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0 Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5 Calculate weighted sum of gini indexes (outlook & temperature) Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2 Play Golf Yes No Temperature Hot 0 2 2 Cool 1 0 1 Mild 1 1 2 5
  • 116. Gini of humidity for sunny outlook 116 Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0 Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0 Calculate weighted sum of gini indexes (outlook & humidity) Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0 Play Golf Yes No Humidity High 0 3 3 Normal 2 0 2 5
  • 117. Gini of wind for sunny outlook 117 Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266 Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2 Calculate weighted sum of gini indexes (outlook & wind) Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466 Play Golf Yes No Wind Weak 1 2 3 Strong 1 1 2 5
  • 118. Decision for sunny outlook 118 Feature Gini index Temperature 0.2 Humidity 0 Wind 0.466 Minimum value Proceed with humidity at the extension of sunny outlook.
  • 119. Decision for sunny outlook 119 Decision is always NO for high humidity and sunny outlook. Decision is always be YES for normal humidity and sunny outlook. Now this branch gets completed.
  • 120. Decisions for high and normal humidity 120
  • 121. Taking the outlook = rain 121 ◻ Calculate gini index scores for temperature, humidity and wind features when outlook is rain. Day Outlook Temp. Humidity Wind Decision 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 10 Rain Mild Normal Weak Yes 14 Rain Mild High Strong No
  • 122. Gini of temprature for rain outlook 122 Play Golf Yes No Temperature Cool 1 1 2 Mild 2 1 3 5 Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5 Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444 Calculate weighted sum of gini indexes (outlook & temperature) Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466
  • 123. Gini of humidity for rain outlook 123 Play Golf Yes No Humidity High 1 1 2 Normal 2 1 3 5 Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5 Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444 Calculate weighted sum of gini indexes (outlook & humidity) Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466
  • 124. Gini of wind for rain outlook 124 Play Golf Yes No Wind Weak 3 0 3 Strong 0 2 2 5 Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0 Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0 Calculate weighted sum of gini indexes (rain outlook & wind) Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0
  • 125. Decision for rain outlook 125 Feature Gini index Temperature 0.466 Humidity 0.466 Wind 0 Minimum value Put the wind feature for rain outlook branch and monitor the new sub data sets.
  • 126. Rain outlook – contd… 126 Decision is always YES when wind is weak. Decision is always NO if wind is strong. Branch is over
  • 127. Final decision tree built by CART algorithm 127
  • 128. Differences b/w CART & ID3 128
  • 129. Evaluating the Decision tree 129 ◻Performance Metrics for Classification Problems ◻Performance Metrics for Regression Problems
  • 130. Performance Metrics for Classification Problems 130 ◻Confusion Matrix ◻Classification Accuracy ◻Classification Report 🞑Precision 🞑Recall 🞑F1 score 🞑Specificity ◻Logarithmic Loss ◻Area under Curve
  • 131. Confusion Matrix 131 ◻It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. ◻Define Confusion Matrix: 🞑A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” 🞑The dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” .
  • 132. Parameters used in Confusion matrix 132 True positive and True negatives observations that are 🡪 correctly predicted and are shown in green color. False positive and False negatives observations that are 🡪 wrongly predicted and are shown in red color. Hence it should be minimized.
  • 133. Explanation for the parameters 133 There are 4 important terms : ◻True Positives : 🞑The cases in which we predicted YES and the actual output was also YES. ◻True Negatives : 🞑The cases in which we predicted NO and the actual output was NO. ◻False Positives : 🞑The cases in which we predicted YES and the actual output was NO. ◻False Negatives : 🞑The cases in which we predicted NO and the actual output was YES.
  • 134. Classification Accuracy 134 ◻ Classification Accuracy (or) accuracy is the ratio of number of correct predictions to the total number of input samples.
  • 135. Classification Report 135 ◻The classification report consists of the scores of: 🞑Precision 🞑Recall 🞑F1 score 🞑Specificity
  • 136. Precision Mainly used in document retrievals 136 ◻Precision is defined as the number of correct documents returned.
  • 137. Recall 137 Recall is defined as the number of positives returned.
  • 138. Specificity 138 ◻Specificity, in contrast to recall, is defined as the number of negatives returned.
  • 139. F1 Score 139 ◻F1 score is the weighted average of the precision and recall. 🞑Best value of F1 1 🡪 🞑Worst value of F1 0 🡪
  • 141. AUC (Area Under ROC curve) 141 ◻AUC (Area Under Curve) - ROC (Receiver Operating Characteristic) is a performance metric, based on varying threshold values, for classification problems. 🞑ROC is a probability curve 🞑AUC measure the separability. ◻Higher the AUC, better the model.
  • 142. AUC - ROC Curve 142
  • 143. LOGLOSS (Logarithmic Loss) Logistic regression loss (or) cross-entropy loss 143 ◻Accuracy is the count of predictions whereas Log Loss is the amount of uncertainty of the prediction. ◻Formula 🞑L(pi)=−log(pi) 🞑 p is the probability attributed to the real class.
  • 144. Performance Metrics for Regression Problems 144 ◻The MSE, MAE, RMSE, and R-Squared metrics are mainly used to evaluate the performance in regression analysis. 🞑Mean Absolute Error (MAE) 🞑Mean Square Error (MSE) 🞑R Squared (R2 )
  • 146. MSE, MAE, RMSE, R2 146 ◻R-squared (Coefficient of determination) 🞑Its the coefficient of how well the values fit compared to the original values. ■R value range from 0 to 1 ■Higher value Model is good. 🡪
  • 147. MAE (Mean absolute error) 147 ◻MAE is the absolute difference between the target value and the value predicted by the model.
  • 148. MSE (Mean Squared Error) Most preferred metrics for regression tasks 148 ◻It is simply the average of the squared difference between the target value and the value predicted by the regression model.
  • 149. RMSE (Root Mean Squared Error) 149 ◻It is the error rate by the square root of MSE.
  • 150. R-squared (Coefficient of determination) 150 ◻It represents the coefficient of how well the values fit compared to the original values. ■R value range from 0 to 1 ■Higher value Model is good. 🡪
  • 151. Decision trees in R 151 ◻Decision tree is a graph to represent choices and their results in form of a tree. 🞑Nodes in the graph represent an event or 🡪 choice 🞑Edges of the graph represent the decision 🡪 rules or conditions. ◻Decision trees mostly used in Machine 🡪 Learning and Data Mining applications using R 🞑 The R package "party" is used to create decision trees.
  • 152. Commands in R 152 ◻R command to install the package ◻To creating a decision tree 🞑formula is a formula describing the predictor and response variables. 🞑data is the name of the data set used. install.packages("party") ctree(formula, data)
  • 153. Commands in R – Contd… 153 ◻Input Data 🞑Take in-built data set named readingSkills to create a decision tree. 🞑 Taking the variables "age",“ shoesize", "score" and check whether the given person is a native speaker or not. # Load the party package. # It will automatically load other dependent packages. library(party) # Print some records from data set readingSkills. print(head(readingSkills))
  • 154. Contd… 154 ◻By executing the above code, it produces the following result and chart. nativeSpeaker age shoeSize score 1 yes 5 24.83189 32.29385 2 yes 6 25.95238 36.63105 3 no 11 30.42170 49.60593 4 yes 7 28.66450 40.28456 5 yes 11 31.88207 55.46085 6 yes 10 30.07843 52.83124 Loading required package: methods Loading required package: grid ............................... ...............................
  • 155. ctree() function to create the decision tree and see its graph 155 # Load the party package. It will automatically load other # dependent packages. library(party) # Create the input data frame. input.dat <- readingSkills[c(1:105),] # Give the chart file a name. png(file = "decision_tree.png") # Create the tree. output.tree <- ctree( nativeSpeaker ~ age + shoeSize + score, data = input.dat) # Plot the tree. plot(output.tree) # Save the file. dev.off()
  • 156. By executing the above code, it produces the following result 156 null device 1 Loading required package: methods Loading required package: grid Loading required package: mvtnorm Loading required package: modeltools Loading required package: stats4 Loading required package: strucchange Loading required package: zoo Attaching package: ‘zoo’ The following objects are masked from ‘package:base’: as.Date, as.Date.numeric Loading required package: sandwich
  • 157. Decision tree 157 Conclusion: ReadingSkills score is less than 38.3 and age is more than 6 is not a native Speaker.
  • 158. Naïve Bayes Algorithm 158 ◻Naive Bayes is one of the powerful machine learning algorithms that is used for classification. ◻It is an extension of the Bayes theorem wherein each feature assumes INDEPENDENCE.
  • 159. Bayes’ Theorem 159 ◻Naive Bayes classifiers is a collection of classification algorithms based on Bayes’ Theorem. Baye’s Theorem: It is a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
  • 160. Baye’s Algorithm – contd… 160 ◻Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). ◻Class conditional independence. 🞑Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors.
  • 161. Formula used in Baye’s 161 ◻ P(c|x) is the posterior probability of class (target) given predictor (attribute). ◻ P(c) is the prior probability of class. ◻ P(x|c) is the likelihood which is the probability of predictor given class. ◻ P(x) is the prior probability of predictor.
  • 162. How Naive Bayes algorithm works? Given dataset: 162
  • 163. Steps: 163 ◻Step 1 🞑The posterior probability is calculated constructing a frequency table for each attribute against the target. ◻Step 2 🞑Transform the frequency tables to likelihood tables ◻Step 3 🞑Use the Naive Bayesian equation to calculate the posterior probability for each class. 🞑Class with the highest posterior probability is the outcome of prediction.
  • 164. 164 Step 1 Calculate posterior probability using frequency table
  • 165. 165 Step 2 The likelihood tables for all four predictors
  • 166. 166 Step 3 Here, 4 inputs and 1 target Final posterior probabilities can be standardized between 0 and 1.
  • 167. Merits of Naive Bayes algorithm 167 ◻Naive Bayes algorithm merits: 🞑Easy and quick way to predict the class of the dataset. ■Hence multi-class prediction is performed easily. 🞑When the assumption of independence is valid, Naive Bayes is much more capable than the other algorithms like logistic regression. 🞑Only less training data is required
  • 168. Demerits of Naive Bayes algorithm 168 ◻Assumption: class conditional independence, hence less accuracy ◻Practically , dependencies exist among variables. 🞑Ex : hospitals, patients, profile, age, family history etc. 🞑Symptoms: fever, cough etc. Disease, lung cancer, diabetes etc. ◻Dependencies among these cannot be modeled by Naive Bayes classifier.