big data analytics unit 2 notes for study

Unit II
BIG DATA
ANALYTICS
Subject Code: CS8091
Regulation : R 2017
1

Text Books and References
3
◻ 1. Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive
Datasets”, Cambridge University Press, 2012.
2. David Loshin, “Big Data Analytics: From Strategic Planning to Enterprise
Integration with Tools, Techniques, NoSQL, and Graph”, Morgan
Kaufmann/El sevier Publishers, 2013.
◻ 1. EMC Education Services, “Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, Wiley publishers,
2015.
2. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to
Data Science and its Applications”, Wiley Publishers, 2015.
3. Dietmar Jannach and Markus Zanker, “Recommender Systems: An
Introduction”, Cambridge University Press, 2010.
4. Kim H. Pries and Robert Dunnigan, “Big Data Analytics: A Practical
Guide for Managers ” CRC Press, 2015.
5. Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce”, Synthesis Lectures on Human Language Technologies, Vol. 3,
No. 1, Pages 1-177, Morgan Claypool publishers, 2010.

Machine learning
◻ Machine learning is an application of artificial
intelligence (AI) that provides systems to
automatically learn and improve from experience
without being explicitly programmed.
4

Unit II
Clustering and Classification
Classification and clustering are two methods of
pattern identification used in machine learning.
8

Example
◻Assume two clusters Mammal &
🡪
Reptile.
🞑Mammal cluster includes human,
🡪
leopards, elephant, etc.
🞑Reptile cluster includes snakes, lizard,
🡪
komodo dragon etc.
◻Some algorithms used for clustering are
k-means clustering algorithm, Fuzzy c-
means clustering algorithm, Gaussian
(EM) clustering algorithm etc.
12

Example
◻The data points in the graph are
clustered together into 3 clusters.
13

Contd…
◻ It is not necessary for clusters to be a spherical.
14

What are the Uses of
Clustering?
◻Clustering has a variety of uses in a
variety of industries.
🞑Market segmentation
🞑Social network analysis
🞑Search result grouping
🞑Medical imaging
🞑Image segmentation
🞑Anamoly detection
15

Contd…
◻After clustering, each cluster is assigned
a number called cluster ID.
16

Types of clustering
◻Hard clustering: Grouping the data items
such that each item is in only one cluster.
◻Soft (or) overlapping Clustering :
Grouping the data items such that data
items can exist in multiple clusters.
17

Cluster Formation Methods
(or) Clustering Methods
◻Density-based Clustering
◻Distribution-based Clustering
◻Partitioning Methods
◻Hierarchical Clustering
18

Density-based Clustering
◻ Similar dense areas are clustered.
◻ Have good accuracy and ability to merge two
clusters.
🞑Example
■DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
■OPTICS (Ordering Points to Identify Clustering
Structure) etc.
19

Contd…
• These clusters take any arbitrary shapes.
20

Partitioning Methods
◻Partition the objects into k clusters and
each partition forms one cluster.
🞑Example:
■K-means & CLARANS (Clustering Large
Applications based upon Randomized Search)
21

K means clustering
It follow Centroid-based Clustering
◻Organizes the data into non-hierarchical
clusters
🞑Example: k-means clustering algorithm.
22

Distribution-based Clustering
◻This clustering approach assumes data is
composed of distributions, such as
Gaussian distributions.
23

Hierarchical Clustering
◻Creates a tree of clusters. Construct
clusters as a tree-type structure based on
the hierarchy.
◻Two categories
🞑Agglomerative (Bottom-up approach)
🞑Divisive (Top-down approach)
◻Example:
🞑Clustering using Representatives (CURE),
🞑Balanced iterative Reducing Clustering using
Hierarchies (BIRCH), etc.
24

Working of Hierarchical clustering
26

Grid-Based method
◻The data space is formulated into a finite
number of cells that form a grid-like
structure
🞑Example
■Statistical Information Grid (STING)
■Clustering in Quest (CLIQUE).
27

K-means clustering algorithm
◻K-Means Clustering is an unsupervised learning
algorithm that is used to solve the clustering
problems in machine learning or data science.
28

What is K-Means Algorithm?
◻K-Means Clustering is an Unsupervised
Learning algorithm, groups unlabeled
🡪
dataset into different clusters.
🞑K Number of pre-defined clusters
🡪
🞑If K=2 Two clusters
🡪
🞑If K=3 Three clusters
🡪
Definition:
It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way each
dataset belongs only one group having similar
properties
29

K-Means Clustering Algorithm
K-Means Algorithm involves following steps:
◻ Step-01:
🞑 Choose the number of clusters K.
◻ Step-02:
🞑 Randomly select any K data points as cluster
centers(centroids).
🞑 Select cluster centers in such a way that they are as farther
as possible from each other.
◻ Step-03:
🞑 Calculate the distance between each data point and each
cluster center.
🞑 The distance may be calculated either by using given
distance function or by using euclidean distance formula.
◻ Step-04:
🞑 Assign each data point to some cluster.
🞑 A data point is assigned to that cluster whose center is
nearest to that data point.
30

Contd…
◻ Step-05:
🞑Re-compute the center of newly formed
clusters.
🞑The center of a cluster is computed by taking
mean of all the data points contained in that
cluster.
◻ Step-06:
🞑Keep repeating the procedure from Steps 3
to 5 until any of the following stopping
criteria is met:
■Center of newly formed clusters do not change
■Data points remain present in the same cluster
■Maximum number of iterations are reached
31

Working of K means
algorithm
Given data points is
pictured as shown.
Step 1:
Let's take number k of clusters, i.e.,
K=2. Choose some random k points
or centroid to form the cluster.
Step 2:
32

Contd…
By applying mathematical
formulas, calculate the distance
between two points. we draw the
median between both the
centroids.
Step 3:
Left side points is near to the K1
or blue centroid and right side
points points are close to the
yellow centroid
Step 4:
33

Contd…
Repeat the process by choosing
a new centroid. Its computed
by calculating the median again.
Step 5:
Reassign each datapoint to
the new centroid.
Step 6:
34

Contd…
we can see, one yellow point is on
the left side of the line, and two blue
points are right to the line. Three
points are assigned new centroids.
Step 7:
Repeat the process by
finding the new centroids
Step 8:
35

Contd…
No dissimilar data points on either
side of the line
Step 9:
Final two clusters is shown
Step 10:
36

Merits & Demerits of K-means
Advantages-
◻ K-Means Clustering Algorithm offers the following advantages-
◻ Point-01:
🞑 It is relatively efficient with time complexity O(nkt) where
🞑 n = number of instances
🞑 k = number of clusters
🞑 t = number of iterations
◻ Point-02:
🞑 It often terminates at local optimum.
🞑 Techniques such as Simulated Annealing or Genetic Algorithms
may be used to find the global optimum.
Disadvantages-
◻ K-Means Clustering Algorithm has the following disadvantages-
◻ It requires to specify the number of clusters (k) in advance.
◻ It can not handle noisy data and outliers.
◻ It is not suitable to identify clusters with non-convex shapes.
38

Determining the number of
clusters:
39
◻Elbow method
◻Average silhouette method
◻Gap statistic method

Elbow method
40
◻The Elbow method
🞑Find Total WSS(within-cluster sum of
square)
🞑WSS Number of clusters
🡪

Elbow method - Steps
41
◻Step 1
🞑Compute clustering algorithm (e.g., k-means
clustering) for different values of k.
■ For instance, by varying k from 1 to 10 clusters.
◻Step 2
🞑For each k, calculate the total within-cluster sum
of square (wss).
◻Step 3
🞑Plot the curve of wss according to the number of
clusters k.
◻Step 4
🞑The location of a bend (knee) in the plot is an
indicator of the appropriate number of clusters.

Average silhouette method
42
◻Compute the average silhouette of
observations for different values of k.
🞑Optimal number of clusters k is the one
that maximize the average silhouette

Average silhouette - Steps
43
◻Step 1
🞑Compute clustering algorithm (e.g., k-means
clustering) for different values of k.
■For instance, by varying k from 1 to 10 clusters.
◻Step 2
🞑For each k, calculate the average silhouette of
observations (avg.sil).
◻Step 3
🞑Plot the curve of avg.sil according to the
number of clusters k.
◻Step 4
🞑The location of the maximum is considered as
the appropriate number of clusters.

Gap statistic method
44
◻Compares the total within intra-cluster
variation for different values of k with
their expected values under null
reference distribution of the data.
◻Optimal cluster k is which maximize the
gap statistic (i.e, that yields the largest
gap statistic).

Gap statistic - steps
45
◻ Step 1
🞑 Cluster the data, varying the k = 1, …, kmax, and compute the
corresponding total within intra-cluster variation Wk.
◻ Step 2
🞑 Generate reference data sets with a random uniform distribution.
Cluster each of these reference data sets with varying number of
clusters k = 1, …, kmax, and compute the corresponding total
within intra-cluster variation Wkb.
◻ Step 3
🞑 Compute the estimated gap statistic as the deviation of the
observed Wk value from its expected value Wkb under the null
hypothesis: Gap(k)=1B∑b=1Blog(W∗kb)
−log(Wk)Gap(k)=1B∑b=1Blog(Wkb )−log(Wk) . Compute also
∗
the standard deviation of the statistics.
◻ Step 4
🞑 Choose the number of clusters as the smallest value of k such
that the gap statistic is within one standard deviation of the gap
at k+1: Gap(k)≥Gap(k + 1)−sk + 1.

Comparison – Different clusters
46

Where I can apply k means?
◻Document Classification
◻Delivery Store Optimization
◻Identifying Crime Localities
◻Customer Segmentation
◻Fantasy League Stat Analysis
◻Insurance Fraud Detection
◻Rideshare Data Analysis
◻Cyber-Profiling Criminals
47

Use case 1
◻Categorizing documents based on tags,
topics, and the content of the document is
very difficult.
◻k-means algorithm is very much suitable
algorithm for this purpose. Based on term
frequency, the document vectors are
clustered to help identify similarity in
document groups.
Document Classification
48

Use case 2
◻Optimized path to solve the truck route
which helps the traveling salesman
problem.
Delivery Store Optimization
49

Use case 3
◻Using the category of crime, the area of
the crime and their association helps to
identify crime-prone areas within a city
Identifying Crime Localities
50

Use case 4
◻Clustering helps marketers to segment
customers based on purchase history,
interests, or activity monitoring.
🞑For example, telecom providers can cluster
pre-paid customers by identifying the money
spent in recharging, sending SMS, and
browsing the internet.
🞑It helps the company target specific clusters
of customers for specific campaigns
Customer Segmentation
51

Use case 5
◻Analyzing player stats has always been a
critical element of the sporting world, with
increasing competition.
◻Machine learning plays a major role.
Fantasy League Stat Analysis
52

Use case 6
◻Machine learning plays a critical role fraud
detection and has numerous applications
in automobile, healthcare, and insurance
fraud detection.
🞑Based on past historical data on fraudulent
claims, fraudulent patterns can be easily
identified.
Insurance Fraud Detection
53

Use case 7
◻Using ride information dataset about
traffic, transit time, peak pickup localities
helps the call taxi drivers like uber helps
to plan the cities for the future.
Rideshare Data Analysis
54

Use case 8
◻Cyber-profiling collects data from
🡪
individuals and groups to identify
significant co-relations.
◻These information on the investigation
division to classify the criminals based on
the their types.
Cyber-Profiling Criminals
55

Use case 9
◻Call detail record (CDR) is the information
captured by telecom companies during the
call, SMS, and internet activity of a
customer.
🞑These information helps to know about the
customer’s needs and their usage details.
Call Record Detail Analysis
56

Use case 10
◻Large IT infrastructure technology
components such as network, storage, or
database generate large volumes of alert
messages.
🞑These alert messages potentially point to
operational issues for processing.
🞑Clustering helps to categorize the alerts
🡪
Automatic Clustering of IT Alerts
57

Clustering and Classification
58

Differences between
Classification and Clustering
59

Classification
◻Classification is the process of learning a
model that categorizes different classes of
data.
◻It’s a two-step process:
◻Learning step
🞑The learning step can be accomplished by
using an already defined training set of data.
◻Prediction step
🞑Predict or classify based on the above
response.
61

Algorithms used for classification
◻Logistic Regression
◻Decision Tree
◻Naive Bayes classifier
◻Support Vector Machines(SVM)
◻Random Forest
62

Decision tree
Decision Tree is a Supervised learning technique
that can be used for both classification and
Regression problems.
63

Definition : Decision tree
◻A decision tree is a tree where
🞑Each node represents a feature(attribute)
🞑Each branch represents a decision(rule)
🞑Each leaf represents an outcome
(categorical or continues value).
64

Idea behind decision tree
◻ Decision Trees usually mimic human like thinking while
making a decision, which is easy to understand.
◻ Create a tree like structure, hence its easily
understood.
◻ Decision trees can handle both categorical and
numerical data.
65

Terminology related to
Decision Trees
◻Root Node:
🞑It represents the entire population or sample or
datasets.
◻Splitting:
🞑It is a process of dividing a node into two or more
sub-nodes.
◻Decision Node:
🞑When a sub-node splits into further sub-nodes,
then it is called the decision node.
◻Leaf / Terminal Node:
🞑Nodes do not split is called Leaf or Terminal node.
66

Terminology contd…
◻Pruning:
🞑It’s the process of removing the unwanted
branches from the tree. Its just opposite of
splitting.
◻Parent/Child node:
🞑The root node of the tree is called the parent
node, and other nodes are called the child
nodes.
67

Decision tree learning Algorithms
69
◻ID3 (Iterative Dichotomiser 3)
🞑ID3 uses Entropy and information Gain to
construct a decision tree
◻C4.5 (successor of ID3)
◻CART (Classification And Regression Tree)
◻CHAID (CHi-squared Automatic Interaction
Detector).
🞑Performs multi-level splits when computing
classification trees)
◻MARS: extends decision trees to handle
numerical data better.

Working of Decision tree algorithm
◻ Step-1:
🞑 Begin the tree with the root node, says S, which contains the
complete dataset.
◻ Step-2:
🞑 Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
◻ Step-3:
🞑 Divide the S into subsets that contains possible values for the
best attributes.
◻ Step-4:
🞑 Generate the decision tree node, which contains the best
attribute.
◻ Step-5:
🞑 Recursively make new decision trees using the subsets of the
dataset created in step -3.
🞑 Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a
leaf node.
70

ID3
72
◻Core algorithm for building decision
trees is called ID3.
🞑Developed by J. R. Quinlan,
🞑Employs a top-down, greedy search
🞑Uses Entropy and Information Gain to
construct a decision tree.

Attribute Selection Measures
◻Main issue is how to select the best
attribute for the root node and for sub-
nodes.
◻This is done by a technique called as
Attribute selection measure or ASM.
🞑Information Gain
🞑Gini Index
🞑Gain Ratio
73

Information Gain
◻To understand information gain we should
know Entropy.
◻Define Entropy:
🞑Randomness or uncertainty of a random variable
X is defined by Entropy.
◻For a binary classification problem with only
two classes, positive and negative class.
🞑If all examples are all positive or all are
negative Entropy will be
🡪 zero i.e, low.
🞑If half of the records are of positive class and half
are of negative class Entropy is
🡪 one i.e, high.
74

Contd…
◻ By calculating entropy measure of each attribute
we can calculate their information gain.
◻ Information Gain calculates the expected reduction
in entropy due to sorting on the attribute.
76

Example:
Construct a Decision Tree by using “information gain” as a criterion
78
Given dataset:

Criterion for attribute selection
81
◻Which is the best attribute?
🞑The one which will result in the smallest
tree
🞑choose the attribute that produces the
“purest” nodes
■Purity improves attribute selection
■Information gain with purity in subsets

Steps to calculate information
gain:
◻Step 1:
🞑Calculate entropy of Target.
◻Step 2:
🞑Entropy for every attribute needs to be
calculated.
🞑Calculate Information Gain using the
following formula.
Information Gain = Target Entropy – (Entropy of all attributes)
82

Contd…
◻To build a decision tree, calculate two
types of entropy using frequency tables
as follows:
🞑Entropy using the frequency table of one
attribute
🞑Entropy using the frequency table of two
attributes
83

Entropy using the frequency
table of one attribute
Note : Use Golf dataset in slideno:77
9 + 5 = 14
5/14 = 0.36 and 9/14 =0.64
84

Entropy using the frequency
table of two attributes
85
◻Formula for calculating the entropy in
splitting process

Entropy using two attributes – contd…
86

90
Choose attribute with the largest information gain.
Divide the dataset by its branches and repeat the process on every branch.

91
gain(Outlook ) = 0.247 bits
gain(Temperature ) = 0.029 bits
gain(Humidity ) 0.152 bits
gain(Windy ) 0.048 bits
Select the attribute with the highest gain ratio
Information gain tells us how important the given
attribute is.
Constructing a decision tree is all about finding
attribute that returns the highest information gain.

Selecting the best attribute
92

A branch with entropy of 0 is a
leaf node.
96

A branch with entropy more
than 0 needs further splitting.
97

Final Decision Tree
99
◻The ID3 algorithm is run recursively on the
non-leaf branches, until all data is classified.

Decision Tree to Decision Rules
100

CART (Classification and
Regression Tree)
101
◻Another decision tree algorithm CART
uses the Gini method to create split
points using Gini Index (Gini Impurity)
and Gini Gain.
◻Define Gini Index:
🞑Gini index or Gini impurity measures the
degree or probability of the randomly
chosen variable is wrongly classified.

Purity and impurity
102
◻Pure
🞑All the elements belong to a single class.
◻Gini index varies between 0 and 1
🞑0 denotes all elements belong to a same
🡪
class
🞑1 denotes that the elements are randomly
🡪
distributed across various classes.
🞑0.5 denotes elements are equally
🡪
distributed.

Equation of Gini Index
103
◻Gini index is a metric for classification in
CART.
◻It stores sum of squared probabilities of
each class.
🞑Formula used is:

Taking the same Example:
Construct a Decision Tree by using “Gini Index” as a criterion
104 Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Outlook
105
Play Golf
Yes No
Outlook
Sunny 2 3 5
Overcast 4 0 4
Rainy 3 2 5
14

Outlook
106
Gini(Outlook=Sunny) = 1 – (2/5)2
– (3/5)2
= 1 – 0.16 – 0.36 = 0.48
Gini(Outlook=Overcast) = 1 – (4/4)2
– (0/4)2
= 0
Gini(Outlook=Rain) = 1 – (3/5)2
– (2/5)2
= 1 – 0.36 – 0.16 = 0.48
Calculate weighted sum of gini indexes (outlook)
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48
= 0.171 + 0 + 0.171 = 0.342

Temperature
107
Play Golf
Yes No
Temperature
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
14

Temperature
108
Gini(Temp=Hot) = 1 – (2/4)2
– (2/4)2
= 0.5
Gini(Temp=Cool) = 1 – (3/4)2
– (1/4)2
= 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2
– (2/6)2
= 1 – 0.444 – 0.111 = 0.445
Calculate weighted sum of gini indexes (Temperature)
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445
= 0.142 + 0.107 + 0.190 = 0.439

Humidity
109
Play Golf
Yes No
Humidity
High 3 4 7
Normal 6 1 7
14
Gini(Humidity=High) = 1 – (3/7)2
– (4/7)2
= 1 – 0.183 – 0.326 = 0.489
Gini(Humidity=Normal) = 1 – (6/7)2
– (1/7)2
= 1 – 0.734 – 0.02 = 0.244
Calculate weighted sum of gini indexes (Humidity)
Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

Wind
110
Play Golf
Yes No
Wind
Weak 6 2 8
Strong 3 3 6
14
Gini(Wind=Weak) = 1 – (6/8)2
– (2/8)2
= 1 – 0.5625 – 0.062 = 0.375
Gini(Wind=Strong) = 1 – (3/6)2
– (3/6)2
= 1 – 0.25 – 0.25 = 0.5
Calculate weighted sum of gini indexes (Humidity)
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

Selecting the attribute
111
Feature Gini index
Outlook 0.342
Temperature 0.439
Humidity 0.367
Wind 0.428
Minimum
value
Therefore outlook is put at the top of the tree.

Sub dataset in the overcast has only “yes”.
This means that overcast leaf is over.
113

Continuing to split
114
◻Focus on the sub dataset for sunny
outlook.
◻Find the gini index scores for
temperature, humidity and wind features
respectively.
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

Gini of temperature for
sunny outlook
115
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2
– (2/2)2
= 0
Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2
– (0/1)2
= 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2
– (1/2)2
= 1 – 0.25 – 0.25 = 0.5
Calculate weighted sum of gini indexes (outlook & temperature)
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5
= 0.2
Play Golf
Yes No
Temperature
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
5

Gini of humidity for sunny
outlook
116
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2
– (3/3)2
= 0
Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2
– (0/2)2
= 0
Calculate weighted sum of gini indexes (outlook & humidity)
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0
Play Golf
Yes No
Humidity
High 0 3 3
Normal 2 0 2
5

Gini of wind for sunny outlook
117
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2
– (2/3)2
= 0.266
Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2
– (1/2)2
= 0.2
Calculate weighted sum of gini indexes (outlook & wind)
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2
= 0.466
Play Golf
Yes No
Wind
Weak 1 2 3
Strong 1 1 2
5

Decision for sunny outlook
118
Feature Gini index
Temperature 0.2
Humidity 0
Wind 0.466
Minimum
value
Proceed with humidity at the extension of sunny outlook.

Decision for sunny outlook
119
Decision is always NO for high humidity and sunny outlook.
Decision is always be YES for normal humidity and sunny outlook.
Now this branch gets completed.

Decisions for high and normal
humidity
120

Taking the outlook = rain
121
◻ Calculate gini index scores for temperature,
humidity and wind features when outlook is rain.
Day Outlook Temp. Humidity Wind Decision
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No

Gini of temprature for rain
outlook
122
Play Golf
Yes No
Temperature
Cool 1 1 2
Mild 2 1 3
5
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2
– (1/2)2
= 0.5
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2
– (1/3)2
= 0.444
Calculate weighted sum of gini indexes (outlook & temperature)
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444
= 0.466

Gini of humidity for rain
outlook
123
Play Golf
Yes No
Humidity
High 1 1 2
Normal 2 1 3
5
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2
– (1/2)2
= 0.5
Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2
– (1/3)2
= 0.444
Calculate weighted sum of gini indexes (outlook & humidity)
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444
= 0.466

Gini of wind for rain outlook
124
Play Golf
Yes No
Wind
Weak 3 0 3
Strong 0 2 2
5
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2
– (0/3)2
= 0
Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2
– (2/2)2
= 0
Calculate weighted sum of gini indexes (rain outlook & wind)
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

Decision for rain outlook
125
Feature Gini index
Temperature 0.466
Humidity 0.466
Wind 0 Minimum
value
Put the wind feature for rain outlook branch and monitor
the new sub data sets.

Rain outlook – contd…
126
Decision is always YES when wind is weak.
Decision is always NO if wind is strong.
Branch
is over

Final decision tree built by
CART algorithm
127

Differences b/w CART & ID3
128

Evaluating the Decision tree
129
◻Performance Metrics for Classification
Problems
◻Performance Metrics for Regression
Problems

Performance Metrics for
Classification Problems
130
◻Confusion Matrix
◻Classification Accuracy
◻Classification Report
🞑Precision
🞑Recall
🞑F1 score
🞑Specificity
◻Logarithmic Loss
◻Area under Curve

Confusion Matrix
131
◻It is the easiest way to measure the
performance of a classification problem
where the output can be of two or more
type of classes.
◻Define Confusion Matrix:
🞑A confusion matrix is nothing but a table with
two dimensions viz. “Actual” and “Predicted”
🞑The dimensions have “True Positives (TP)”,
“True Negatives (TN)”, “False Positives (FP)”,
“False Negatives (FN)” .

Parameters used in
Confusion matrix
132
True positive and True negatives observations that are
🡪
correctly predicted and are shown in green color.
False positive and False negatives observations that are
🡪
wrongly predicted and are shown in red color. Hence it should be
minimized.

Explanation for the
parameters
133
There are 4 important terms :
◻True Positives :
🞑The cases in which we predicted YES and the
actual output was also YES.
◻True Negatives :
🞑The cases in which we predicted NO and the
actual output was NO.
◻False Positives :
🞑The cases in which we predicted YES and the
actual output was NO.
◻False Negatives :
🞑The cases in which we predicted NO and the
actual output was YES.

Classification Accuracy
134
◻ Classification Accuracy (or) accuracy is the
ratio of number of correct predictions to the
total number of input samples.

Classification Report
135
◻The classification report consists of the
scores of:
🞑Precision
🞑Recall
🞑F1 score
🞑Specificity

Precision
Mainly used in document retrievals
136
◻Precision is defined as the number of
correct documents returned.

Recall
137
Recall is defined as the number of
positives returned.

Specificity
138
◻Specificity, in contrast to recall, is
defined as the number of negatives
returned.

F1 Score
139
◻F1 score is the weighted average of
the precision and recall.
🞑Best value of F1 1
🡪
🞑Worst value of F1 0
🡪

AUC (Area Under ROC curve)
141
◻AUC (Area Under Curve) - ROC
(Receiver Operating Characteristic) is a
performance metric, based on varying
threshold values, for classification
problems.
🞑ROC is a probability curve
🞑AUC measure the separability.
◻Higher the AUC, better the model.

LOGLOSS (Logarithmic Loss)
Logistic regression loss (or) cross-entropy loss
143
◻Accuracy is the count of predictions
whereas Log Loss is the amount of
uncertainty of the prediction.
◻Formula
🞑L(pi)=−log(pi)
🞑 p is the probability attributed to the real
class.

Performance Metrics for
Regression Problems
144
◻The MSE, MAE, RMSE, and R-Squared
metrics are mainly used to evaluate the
performance in regression analysis.
🞑Mean Absolute Error (MAE)
🞑Mean Square Error (MSE)
🞑R Squared (R2
)

MSE, MAE, RMSE, R2
146
◻R-squared (Coefficient of determination)
🞑Its the coefficient of how well the values fit
compared to the original values.
■R value range from 0 to 1
■Higher value Model is good.
🡪

MAE (Mean absolute error)
147
◻MAE is the absolute difference between
the target value and the value predicted
by the model.

MSE (Mean Squared Error)
Most preferred metrics for regression tasks
148
◻It is simply the average of the squared
difference between the target value and
the value predicted by the regression
model.

RMSE (Root Mean Squared Error)
149
◻It is the error rate by the square root of MSE.

R-squared (Coefficient of
determination)
150
◻It represents the coefficient of how well
the values fit compared to the original
values.
■R value range from 0 to 1
■Higher value Model is good.
🡪

Decision trees in R
151
◻Decision tree is a graph to represent
choices and their results in form of a tree.
🞑Nodes in the graph represent an event or
🡪
choice
🞑Edges of the graph represent the decision
🡪
rules or conditions.
◻Decision trees mostly used in Machine
🡪
Learning and Data Mining applications using
R
🞑 The R package "party" is used to create decision trees.

Commands in R
152
◻R command to install the package
◻To creating a decision tree
🞑formula is a formula describing the
predictor and response variables.
🞑data is the name of the data set used.
install.packages("party")
ctree(formula, data)

Commands in R – Contd…
153
◻Input Data
🞑Take in-built data set named readingSkills to
create a decision tree.
🞑 Taking the variables "age",“ shoesize",
"score" and check whether the given person is
a native speaker or not.
# Load the party package.
# It will automatically load other dependent packages.
library(party)
# Print some records from data set readingSkills.
print(head(readingSkills))

Contd…
154
◻By executing the above code, it
produces the following result and chart.
nativeSpeaker age shoeSize score
1 yes 5 24.83189 32.29385
2 yes 6 25.95238 36.63105
3 no 11 30.42170 49.60593
4 yes 7 28.66450 40.28456
5 yes 11 31.88207 55.46085
6 yes 10 30.07843 52.83124
Loading required package: methods
Loading required package: grid
...............................
...............................

ctree() function to create the
decision tree and see its graph
155
# Load the party package. It will automatically load other
# dependent packages.
library(party)
# Create the input data frame.
input.dat <- readingSkills[c(1:105),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(
nativeSpeaker ~ age + shoeSize + score,
data = input.dat)
# Plot the tree.
plot(output.tree)
# Save the file.
dev.off()

By executing the above code,
it produces the following result
156
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: sandwich

Decision tree
157
Conclusion:
ReadingSkills score is less than 38.3 and age is more than 6 is not a
native Speaker.

Naïve Bayes Algorithm
158
◻Naive Bayes is one of the powerful
machine learning algorithms that is used
for classification.
◻It is an extension of the Bayes theorem
wherein each feature assumes
INDEPENDENCE.

Bayes’ Theorem
159
◻Naive Bayes classifiers is a collection of
classification algorithms based on
Bayes’ Theorem.
Baye’s Theorem:
It is a family of algorithms where all of them share
a common principle, i.e. every pair of features
being classified is independent of each other.

Baye’s Algorithm – contd…
160
◻Bayes theorem provides a way of
calculating the posterior probability,
P(c|x), from P(c), P(x), and P(x|c).
◻Class conditional independence.
🞑Naive Bayes classifier assume that the
effect of the value of a predictor (x) on a
given class (c) is independent of the values
of other predictors.

Formula used in Baye’s
161
◻ P(c|x) is the posterior probability of class (target) given predictor
(attribute).
◻ P(c) is the prior probability of class.
◻ P(x|c) is the likelihood which is the probability of predictor given
class.
◻ P(x) is the prior probability of predictor.

How Naive Bayes algorithm works?
Given dataset:
162

Steps:
163
◻Step 1
🞑The posterior probability is calculated
constructing a frequency table for each
attribute against the target.
◻Step 2
🞑Transform the frequency tables to likelihood
tables
◻Step 3
🞑Use the Naive Bayesian equation to calculate
the posterior probability for each class.
🞑Class with the highest posterior probability is
the outcome of prediction.

164
Step 1
Calculate posterior probability using frequency table

165
Step 2
The likelihood tables for all four predictors

166
Step 3
Here, 4 inputs and 1 target
Final posterior probabilities can be standardized between 0 and 1.

Merits of Naive Bayes algorithm
167
◻Naive Bayes algorithm merits:
🞑Easy and quick way to predict the class of
the dataset.
■Hence multi-class prediction is performed
easily.
🞑When the assumption of independence is
valid, Naive Bayes is much more capable
than the other algorithms like logistic
regression.
🞑Only less training data is required

Demerits of Naive Bayes
algorithm
168
◻Assumption: class conditional
independence, hence less accuracy
◻Practically , dependencies exist among
variables.
🞑Ex : hospitals, patients, profile, age, family
history etc.
🞑Symptoms: fever, cough etc. Disease, lung
cancer, diabetes etc.
◻Dependencies among these cannot be
modeled by Naive Bayes classifier.

big data analytics unit 2 notes for study

More Related Content

Similar to big data analytics unit 2 notes for study (20)

Recently uploaded (20)

big data analytics unit 2 notes for study