Introduction to Machine Learning (ML) Final - Copy.pdf

Dr. Rahul J. Pandya,
Assistant Professor,
Electrical, Electronics, and Communication Engineering (EECE) Dept.,
Indian Institute of Technology (IIT), Dharwad
Email: rpandya@iitdh.ac.in 1
Introduction to Machine Learning

Course content - Syllabus
RJEs: Remote job entry points
▪ Introduction to Machine Learning (ML)
▪ Types of Machine learning
▪ Supervised ML
▪ Unsupervised ML
▪ Semi-Supervised ML
▪ Reinforcement Learning (RL)
▪ Machine learning (ML) algorithms
▪ Regression- Linear Regression, Logistic Regression, Multivariate Regression
▪ Classification
▪ Clustering – Partitional clustering, Hierarchical clustering, Density based clustering
▪ Decision trees
▪ K-Nearest Neighbours (KNN)
▪ Kernel methods: Support vector machine
▪ Reinforcement Learning (RL) algorithms
▪ Graphical models: Gaussian mixture models and hidden Markov models
▪ Introduction to Bayesian Approach: Bayesian classification, Bayesian learning,
Bayes optimal classifier, and Naïve Bayes Classifier. 2

Reference books
▪ C. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2006
▪ K. P. Murphy, “Machine Learning: A Probability Perspective”, MIT Press,
2012.
3

Introduction
to
Machine Learning (ML)
4

*`
Artificial Intelligence (AI)
Enables systems to perform
intelligent tasks through a set of rules
https://www.geeksforgeeks.org/difference-between-artificial-intelligence-
vs-machine-learning-vs-deep-learning/
5

*`
It is a process of learning from the
data without using complex rules. It
involves training a model from
datasets and predicting the outcome.
6

*`
It is a process of learning from the
data without using complex rules. It
involves training a model from
datasets and predicting the outcome.
Deep Learning (DL)
ML at a large-scale,
Equipped with
artificial neural
networks
7

Introduction to Machine Learning (ML)
▪ Artificial Intelligence (AI):
Approaches that enable computers
to perform intelligent tasks.
▪ Machine Learning (ML): Approaches
that learn the underlying pattern in
given set of features without being
explicitly programmed.
▪ Deep Learning (DL): Approaches
that learn the underlying
representations and patterns in
given set of raw data without being
explicitly programmed.
8

Artificial Intelligence
▪ Intelligence: Experiencing (ability to learn & understand) and use it for deciding
future course
▪ Artificial Intelligence (AI): Enabling machines to do so called intelligent tasks
▪ Problem solving
▪ Discovery
▪ Learning
▪ Dealing with uncertainties
▪ AI Categories:
▪ Problem solving using search methods
▪ State space search, heuristic search, randomized search, rule based,
▪ Symbolic manipulation is one form of AI
▪ Connectionist approach is another form of AI
▪ S R Mahadeva Prasanna PRML August 9
9

Machine Learning
▪ With more and more digital data available, task of automatic discovery and
learning of patterns, both natural and synthetic data.
▪ Not much focus on feature extraction, signal processing knowledge not
pre-requisite !
▪ More emphasis on discovery and learning of patterns by machine.
▪ Ability to learn by extracting patterns from data (features)
▪ Treated pattern learning more like associated function learning.
▪ Output y = f (x), where y is output and x is input data (features).
▪ Goal of ML is to learn f () that maps x to y.
Ref: https://www.javatpoint.com/reinforcement-learning
10

Deep Learning
▪ Task of learning both features (representation) and also patterns for pattern
recognition.
▪ Trying to mimic human way of learning.
▪ Learning from experience
▪ Need not specify everything in the beginning
▪ Understand in terms of hierarchy of concepts
▪ Each concept defined in terms its relation to simpler concepts
▪ Learning complicated concepts out of simpler ones
▪ S
Ref: https://www.javatpoint.com/reinforcement-learning
11

What
is
Machine Learning (ML)?
12

▪ Machin learning gives “ computers the ability to learn without being explicitly programmed.”
~ Arthur Samuel
Ref: https://pub.towardsai.net/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa
https://www.google.com/imgres?imgurl=https%3A%2F%2Fprutor.ai%2Fwp-content%2Fuploads%2FML-vs-Programming.png&tbnid= -https://prutor.ai/ml-what-is-machine-learning/
▪ Preparing for the exams
▪ Students feed their machine
(brain) with a good amount of
high-quality data (questions and
answers from different books or
teachers notes or online video
lectures).
▪ Training their brain with input as
well as output i.e. what kind of
approach or logic do they have to
solve a different kind of questions.
13

▪ Machin learning gives “ computers the ability to learn without being explicitly programmed.”
~ Arthur Samuel
https://www.google.com/imgres?imgurl=https%3A%2F%2Fprutor.ai%2Fwp-content%2Fuploads%2FML-vs-Programming.png&tbnid= -https://prutor.ai/ml-what-is-machine-learning/
▪ Preparing for the exams
▪ Similarly, in ML train machine with
data (both inputs and outputs are
given to model) and when the
time comes test on data (with
input only) and achieves our
model scores by comparing its
answer with the actual output
which has not been fed while
training.
14

How ML works?
▪ Features of ML:
▪ Machine learning uses data to detect various patterns in a given dataset.
▪ It can learn from past data and improve automatically.
▪ It is a datas-driven technology.
▪ Machine learning is like data mining as it also deals with vast data.
Ref: https://www.javatpoint.com/machine-learning
▪ Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it
15

Why Machine Learning (ML)?
▪ Machin learning gives “ computers the ability to learn without being explicitly
programmed.” ~ Arthur Samuel
▪ Why ML?
▪ Machine learning models help us in many tasks, such as:
▪ Object Recognition
▪ Summarization
▪ Prediction
▪ Classification
▪ Clustering
▪ Recommender systems
▪ And others
▪ ML refers to the scientific branch of AI
▪ Deep learning is a subset of ML
https://www.google.com/imgres?imgurl=https%3A%2F%2Fprutor.ai%2Fwp-content%2Fuploads%2FML-vs-Programming.png&tbnid= -
ETheD8sGlw9TM&vet=12ahUKEwj4k9OO7NOAAxXc5TgGHQy1CN8QMygHegUIARDTAQ..i&imgrefurl=https%3A%2F%2Fprutor.ai%2Fml-what-is-machine-learning%2F&docid=-yk7-
zimRN69qM&w=571&h=223&q=What%20is%20Machine%20Learning%20(ML)%3F&ved=2ahUKEwj4k9OO7NOAAxXc5TgGHQy1CN8QMygHegUIARDTAQ
16

Basic Difference in ML and Traditional Programming?
RJEs: Remote job entry points https://prutor.ai/ml-what-is-machine-learning/
▪ What does exactly learning means for a computer?
▪ Learning from Experiences with respect to some class of Tasks, if its performance in a
given Task improves with the Experience.
▪ Learn from experience E with respect to some class of tasks T and performance measure
P, if its performance at tasks in T, as measured by P, improves with experience E
▪ Traditional Programming: We feed in
DATA (Input) + PROGRAM (logic), run it
on machine and get output.
▪ Machine Learning: We feed in
DATA(Input) + Output, run it on machine
during training and the machine creates
its own program(logic), which can be
evaluated while testing.
17

Machine Learning in Current world
RJEs: Remote job entry points https://www.javatpoint.com/applications-of-machine-learning
18

Traditional ML vs DL
RJEs: Remote job entry points https://www.researchgate.net/figure/Comparison-between-ML-and-Dl-algorithm_fig5_344628869
19

Neural Nets vs Deep Learning
RJEs: Remote job entry points [Taken from public domain. Original authors highly acknowledged.]
▪ The concept of deep learning first originated from neural network.
▪ A good example of deep neural network is a feed forward neural network (FFNN).
▪ Backpropagation (BP) is the workhorse algorithm for learning the parameters of
FFNN.
▪ BP did not work well for networks having more than a small number of hidden
layers.
▪ Insufficient data leading to overfitting and difficulty in training of the deep networks
was the main limitation.
20

NN vs DL
RJEs: Remote job entry points https://yangxiaozhou.github.io/data/2020/09/24/intro-to-cnn.html
21

AI, ML, NN and DL
https://www.researchgate.net/figure/Relationship-between-artificial-intelligence-machine-learning-deep-learning-and_fig2_351110482
22

AI, ML, NN and DL
https://www.researchgate.net/figure/Relationship-between-artificial-intelligence-machine-learning-deep-learning-and_fig2_351110482
23

Information Extraction & Modeling
▪ Information : Knowledge about something. Face, speaker, route.
▪ Extraction : Extract physical quantities that carry information.
▪ Feature extraction or representation learning
▪ Modeling : Invariant entity that carries the knowledge. From features
model these invariant entities.
▪ Process : In human computer interaction it refers to signal processing,
pattern recognition, machine learning and deep learning.
24

When Machine Learning and when Deep Learning
▪ Problem statement well / ill defined
▪ Amount of data less / too much
▪ Domain knowledge is high / low
▪ Well meaning feature extraction possible / not possible
▪ Machine learning / deep learning
25

Classification of Machine Learning
▪ At a broad level, machine learning can be classified into three types:
https://www.javatpoint.com/machine-learning
▪ Supervised ML models
▪ Unsupervised ML models
▪ Semi-supervised ML models
(combination of Supervised and
Unsupervised models)
▪ Reinforcement learning models
26

Regression
RJEs: Remote job entry points [Taken from public sources. Original authors acknowledged.]
▪ Objective of regression task.
▪ Univariate vs multivariate regression.
▪ Linear vs nonlinear regression.
▪ Cost function.
▪ Gradient descent method of optimization.
▪ Normal equation approach for parameter estimation.
▪ Logistic regression
28

Clustering
▪ Objective of clustering task.
▪ Partitioning approach - k-means, fuzzy-c means.
▪ Model based approach - Gaussian mixture model (GMM).
▪ Expectation-maximization (EM) algorithm.
▪ Hierarchical clustering.
▪ Hierarchical - agglomerative clustering.
▪ Hierarchical - divisive clustering.S
29

Classification
▪ Objective of classification task.
▪ Binary vs multiclass classification.
▪ Generative vs discriminative classification.
▪ Parametric vs nonparametric classification.
▪ Logistic regression.
▪ k-nearest neighbour classification.
▪ Support vector machine.
▪ Generative classifiers.
30

Dimensionality Reduction
▪ Objective of dimensionality reduction task.
▪ Principal component analysis (PCA).
▪ Linear discriminant analysis (LDA).
▪ PCA based dimensionality reduction.
▪ PCA based classification.
▪ LDA based dimensionality reduction.
▪ LDA based classification.
31

Time Series Modelling
▪ Objective of time series modelling task.
▪ Markov process and models.
▪ Observable vs hidden Markov model
▪ Hidden Markov Model (HMM).
▪ Training and testing of HMM
▪ Forward and backward variables.
▪ Viterbi algorithm for optimal state sequence.
▪ Expectation maximization (EM) approach for training.
32

Bayesian Approach
▪ Objective of Bayesian approach.
▪ Probabilistic framework for classification.
▪ Bayesian classification.
▪ Bayesian learning.
▪ Maximum a posteriori (MAP) approach.
▪ Bayes optimal classifier.
▪ Gibbs sampling.
▪ Naive Bayes classifier.
▪ Bayesian network.
33

Types
of
34

Classification of Machine Learning
▪ At a broad level, machine learning can be classified into three types:
https://www.javatpoint.com/machine-learning
▪ Supervised ML models
▪ Unsupervised ML models
▪ Semi-supervised ML models
(combination of Supervised and
Unsupervised models)
▪ Reinforcement learning models
35

Supervised
36

Supervised Machine Learning
RJEs: Remote job entry points Ref: https://www.javatpoint.com/supervised-machine-learning
▪ Supervised learning is a type of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output.
▪ The labelled data means some input data is already tagged with the correct
output.
▪ In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly.
▪ It applies the same concept as a student learns in the supervision of the teacher.
▪ Supervised learning is a process of providing input data as well as correct output
data to the machine learning model. The aim of a supervised learning algorithm is to
find a mapping function to map the input variable(x) with the output variable(y).
37

How Supervised Learning Works?
▪ In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
Ref: https://www.javatpoint.com/supervised-machine-learning
▪ Suppose we have a dataset of different types
of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is
that we need to train the model for each
shape.
▪ If the given shape has four sides, and all the
sides are equal, then it will be labelled as a
Square.
▪ If the given shape has three sides, then it will
be labelled as a triangle.
▪ If the given shape has six equal sides then it will be labelled as hexagon.
▪ Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
38

How Supervised Learning Works?
▪ In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
▪ Algorithms like Decision tree, Random Forest, KNN, Logistic Regression, etc. fall under
supervised ML models
39

Steps Involved in Supervised Learning
•First Determine the type of training dataset
•Collect/Gather the labelled training data.
•Split the training dataset into training dataset,
test dataset, and validation dataset.
•Determine the input features of the training
dataset, which should have enough knowledge so
that the model can accurately predict the output.
•Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
•Execute the algorithm on the training dataset.
•Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output,
which means our model is accurate.
40

Types of supervised Machine Learning Algorithms
Regression
• Regression algorithms are used if there is a relationship between the input
variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Below are some
popular Regression algorithms which come under supervised learning:
•Linear Regression
•Regression Trees
•Non-Linear Regression
•Bayesian Linear Regression
•Polynomial Regression
Classification
• Classification algorithms are used when the output variable is
categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.
• Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support Vector Machines
41

Advantages/Disadvantages of Supervised learning
Advantages of supervised learning:
•With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
•In supervised learning, we can have an exact idea about the classes of objects.
•Supervised learning model helps us to solve various real-world problems such as fraud detection,
spam filtering, etc.
Disadvantages of supervised learning:
•Supervised learning models are not suitable for handling the complex tasks.
•Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
•Training required lots of computation times.
•In supervised learning, we need enough knowledge about the classes of object.
42

Unsupervised Machine Learning
▪ Supervised machine learning in which models are trained using labeled data under the
supervision of training data. But there may be many cases in which we do not have labeled data
and need to find the hidden patterns from the given dataset. So, to solve such types of cases in
machine learning, we need unsupervised learning techniques.
▪ What is Unsupervised Learning?
▪ Unsupervised learning is a machine learning technique in which models are not supervised using
training dataset. Instead, models itself find the hidden patterns and insights from the given data.
It can be compared to learning which takes place in the human brain while learning new things. It
can be defined as:
▪ Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
▪ Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.
43

Unsupervised Machine Learning
▪ Suppose the unsupervised learning algorithm is
given an input dataset containing images of
different types of cats and dogs.
▪ The algorithm is never trained upon the given
dataset, which means it does not have any idea
about the features of the dataset.
▪ The task of the unsupervised learning algorithm is to
identify the image features on their own.
▪ Unsupervised learning algorithm will perform this
task by clustering the image dataset into the
groups according to similarities between images.
44

Why use Unsupervised Learning?
•Unsupervised learning is helpful for finding useful insights from the
data.
•Unsupervised learning is much similar as a human learns to think
by their own experiences, which makes it closer to the real AI.
•Unsupervised learning works on unlabelled and uncategorized data
which make unsupervised learning more important.
•In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need unsupervised
learning.
45

Working of Unsupervised Learning
▪ Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given.
▪ Now, this unlabeled input data is fed to the machine learning model in order to train it. Firstly, it
will interpret the raw data to find the hidden patterns from the data and then will apply suitable
algorithms such as k-means clustering, Decision tree, etc.
▪ Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
46

Types of Unsupervised Learning Algorithm
•Clustering:
•Clustering is a method of grouping the objects into clusters
such that objects with most similarities remains into a group
and has less or no similarities with the objects of another group.
•Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of
those commonalities.
•Association:
•An association rule is an unsupervised learning method which is
used for finding the relationships between variables in the large
database.
•It determines the set of items that occurs together in the dataset.
• Association rule makes marketing strategy more effective.
•Such as people who buy X item (suppose a bread) are also tend
to purchase Y (Butter/Jam) item.
•A typical example of Association rule is Market Basket
Analysis.
47

Unsupervised Learning algorithms
•K-means clustering
•KNN (k-nearest neighbors)
•Hierarchal clustering
•Anomaly detection
•Neural Networks
•Principle Component Analysis
•Independent Component Analysis
•Apriori algorithm
•Singular value decomposition
48

Advantages/ Disadvantages of Unsupervised Learning
•Advantages
•Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labelled input data.
•Unsupervised learning is preferable as it is easy to get unlabelled data in
comparison to labelled data.
•Disadvantages
•Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
•The result of the unsupervised learning algorithm might be less accurate as input
data is not labelled, and algorithms do not know the exact output in advance
49

Difference between Supervised and Unsupervised Learning
50

Difference between Supervised and Unsupervised Learning
Supervised Learning Unsupervised Learning
▪ Supervised learning algorithms are trained using labeled data.
▪ Unsupervised learning algorithms are trained using unlabeled
data.
▪ Supervised learning model takes direct feedback to check if it is
predicting correct output or not.
▪ Unsupervised learning model does not take any feedback.
▪ Supervised learning model predicts the output. ▪ Unsupervised learning model finds the hidden patterns in data.
▪ In supervised learning, input data is provided to the model along
with the output.
▪ In unsupervised learning, only input data is provided to the
model.
▪ The goal of supervised learning is to train the model so that it
can predict the output when it is given new data.
▪ The goal of unsupervised learning is to find the hidden patterns
and useful insights from the unknown dataset.
▪ Supervised learning needs supervision to train the model.
▪ Unsupervised learning does not need any supervision to train the
model.
▪ Supervised learning can be categorized in Classification and
Regression problems.
▪ Unsupervised Learning can be classified in Clustering and
Associations problems.
▪ Supervised learning can be used for those cases where we know
the input as well as corresponding outputs.
▪ Unsupervised learning can be used for those cases where we have
only input data and no corresponding output data.
▪ Supervised learning model produces an accurate result.
▪ Unsupervised learning model may give less accurate result as
compared to supervised learning.
▪ Supervised learning is not close to true Artificial intelligence as
in this, we first train the model for each data, and then only it can
predict the correct output.
▪ Unsupervised learning is more close to the true Artificial
Intelligence as it learns similarly as a child learns daily routine
things by his experiences.
▪ It includes various algorithms such as Linear Regression,
Logistic Regression, Support Vector Machine, Multi-class
▪ It includes various algorithms such as Clustering, KNN, and
Apriori algorithm.
Ref: https://www.javatpoint.com/supervised-machine-learning
51

Regression Analysis in Machine learning
▪ Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or more
independent variables. More specifically, Regression
analysis helps us to understand how the value of the
dependent variable is changing corresponding to an
independent variable when other independent
variables are held fixed. It predicts continuous/real
values such as temperature, age, salary, price, etc.
▪ Example: Suppose there is a marketing company A,
who does various advertisement every year and get
sales on that. The list shows the advertisement made
by the company in the last 5 years and the
corresponding sales:
53

▪ Now, the company wants to do the
advertisement of $200 in the year and wants to
know the prediction about the sales for this
year. So to solve such type of prediction
problems in machine learning, we need
regression analysis.
54

Regression Analysis in Machine Learning
▪ Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us
to predict the continuous output variable based on the one or
more predictor variables. It is mainly used for prediction,
forecasting, time series modelling, and determining the
causal-effect relationship between variables.
▪ In Regression, we plot a graph between the variables
which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the
data. In simple words, "Regression shows a line or curve
that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance
between the datapoints and the regression line is
minimum." The distance between datapoints and line tells
whether a model has captured a strong relationship or not.
55

Some examples of regression can be as:
•Prediction of rain using temperature and other factors
•Determining Market trends
•Prediction of road accidents due to rash driving.
56

Why do we use Regression Analysis?
▪ Regression analysis helps in the prediction of a continuous variable. There are
various scenarios in the real world where we need some future predictions such
as weather condition, sales prediction, marketing trends, etc., for such case we
need some technology which can make predictions more accurately. So for such
case we need Regression analysis which is a statistical method and used in
machine learning and data science.
▪ Regression estimates the relationship between the target and the independent
variable.
▪ It is used to find the trends in data.
▪ By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
59

Types of Regression
60

Linear Regression
▪ Linear regression is a statistical regression
method which is used for predictive analysis.
▪ Shows the relationship between the continuous
variables.
▪ It is used for solving the regression problem in
machine learning.
▪ Linear regression shows the linear relationship
between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called
linear regression.
Y= aX+b
Here, Y = dependent variables (target
variables),
X= Independent variables (predictor
variables),
a and b are the linear coefficients
62

Linear Regression
▪ If there is only one input variable (x),
then such linear regression is called
simple linear regression. And if
there is more than one input
variables, then such linear regression
is called multiple linear regression.
▪ The relationship between variables in
the linear regression model can be
explained using the image. Here we
are predicting the salary of an
employee on the basis of the year of
experience.
Here, Y = dependent variables
(target variables),
X= Independent variables
(predictor variables),
a and b are the linear coefficients
Y= aX+b
63

Linear Regression
▪ Linear regression is one of the easiest and most popular
Machine Learning algorithms. It is a statistical method that
is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as
sales, salary, age, product price, etc.
▪ Linear regression algorithm shows a linear relationship
between a dependent (y) and one or more independent (y)
variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it
finds how the value of the dependent variable is changing
according to the value of the independent variable.
▪ The linear regression model provides a sloped straight line
representing the relationship between the variables.
64

Linear Regression in Machine Learning
y= a0+a1x+ ε
Here,
▪ Y= Dependent Variable (Target Variable)
▪ X= Independent Variable (Predictor Variable)
▪ a0= Intercept of the line (Gives an additional degree
of freedom)
▪ a1 = Linear regression coefficient (scale factor to
each input value).
▪ ε = random error
▪ The values for x and y variables are training
datasets for Linear Regression model
representation. 65

Types of Linear Regression
▪ Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.
▪ Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
66

Finding the best fit line
▪ When working with linear regression, our
main goal is to find the best fit line that
means the error between predicted values
and actual values should be minimized. The
best fit line will have the least error.
▪ The different values for weights or the
coefficient of lines (a0, a1) gives a different
line of regression, so we need to calculate
the best values for a0 and a1 to find the best
fit line, so to calculate this we use cost
function.
y= a0+a1x+ ε
67

Cost function
▪ The different values for weights or coefficients of
lines (a0, a1) give the different line of regression,
and the cost function is used to estimate the values
of the coefficients for the best fit line.
▪ Cost function optimizes the regression coefficients
or weights. It measures how a linear regression
model is performing.
▪ We can use the cost function to find the accuracy of
the mapping function, which maps the input
variable to the output variable. This mapping
function is also known as Hypothesis function.
y= a0+a1x+ ε
68

Cost function
Where,
N=Total number of observations
Yi = Actual value
(a1xi+a0)= Predicted value
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function will
be high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
▪ For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values.
▪ For the above linear equation, MSE can be calculated as:
69

Logistic Regression
▪ Logistic regression is another supervised learning
algorithm which is used to solve the classification
problems. In classification problems, we have
dependent variables in a binary or discrete format
such as 0 or 1.
▪ Logistic regression algorithm works with the
categorical variable such as 0 or 1, Yes or No, True
or False, Spam or not spam, etc.
▪ It is a predictive analysis algorithm which works on
the concept of probability.
▪ Logistic regression uses sigmoid function or logistic
function which is a complex cost function. This
sigmoid function is used to model the data in logistic
regression.
71

Logistic Regression
RJEs: Remote job entry points https://mathworld.wolfram.com/SigmoidFunction.html
▪ f(x) = Output between the 0 and 1 value
▪ x = input to the function
▪ e = base of natural logarithm
▪ There are three types of logistic regression:
•Binary (0/1, pass/fail)
•Multi (cats, dogs, lions)
•Ordinal (low, medium, high)
73

Polynomial Regression
•Polynomial Regression is a type of regression which
models the non-linear dataset using a linear model.
•It is similar to multiple linear regression, but it fits a
non-linear curve between the value of x and
corresponding conditional values of y.
•Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those
datapoints. To cover such datapoints, we need Polynomial regression.
•In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modelled using a linear model. Which means
the data-points are best fitted using a polynomial line.
75

Polynomial Regression
•The equation for polynomial regression also
derived from linear regression equation that means
Linear regression equation Y=b0+b1x, is
transformed into Polynomial regression equation
Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn
•Here Y is the predicted/target output, b0, b1,... bn
are the regression coefficients. x is our
independent/input variable.
•The model is still linear as the coefficients are still
linear with quadratic
76

Support Vector Regression
▪ Support Vector Machine is a supervised learning
algorithm which can be used for regression as well as
classification problems. So if we use it for regression
problems, then it is termed as Support Vector Regression.
▪ Support Vector Regression is a regression algorithm
which works for continuous variables. Below are some
keywords which are used in Support Vector Regression:
▪ Kernel: It is a function used to map a lower-
dimensional data into higher dimensional data.
▪ Hyperplane: In general SVM, it is a separation line
between two classes, but in SVR, it is a line which helps
to predict the continuous variables and cover most of the
datapoints.
78

Support Vector Regression
▪ Boundary line: Boundary lines are the two lines
apart from hyperplane, which creates a margin for
data points.
▪ Support vectors: Support vectors are the datapoints
which are nearest to the hyperplane and opposite
class. In SVR, we always try to determine a hyperplane
with a maximum margin, so that maximum number of
datapoints are covered in that margin.
▪ The main goal of SVR is to consider the maximum
data points within the boundary lines and the
hyperplane (best-fit line) must contain a maximum
number of data points.
79

Decision Tree Regression
•Decision Tree is a supervised learning algorithm which can
be used for solving both classification and regression
problems.
•It can solve problems for both categorical and numerical data
•Decision Tree regression builds a tree-like structure in which
each internal node represents the "test" for an attribute,
each branch represent the result of the test, and each leaf
node represents the final decision or result.
•A decision tree is constructed starting from the root
node/parent node (dataset), which splits into left and right child
nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the
parent node of those nodes.
80

Decision Tree Regression
▪ Image showing the example of Decision
Tee regression, here, the model is trying
to predict the choice of a person between
Sports cars or Luxury car.
g(x)= f0(x)+ f1(x)+ f2(x)+....
82

Random forest
▪ Random forest is one of the most
powerful supervised learning algorithms
which is capable of performing
regression as well as classification
tasks
▪ The Random Forest regression is an
ensemble learning method which
combines multiple decision trees and
predicts the final output based on the
average of each tree output. The
combined decision trees are called as
base models
g(x)= f0(x)+ f1(x)+ f2(x)+....
84

Random forest
▪ Random forest uses Bagging or
Bootstrap Aggregation technique
of ensemble learning in which
aggregated decision tree runs in
parallel and do not interact with
each other.
▪ With the help of Random Forest
regression, we can prevent
Overfitting in the model by
creating random subsets of the
dataset.
85

Ridge Regression
•Ridge regression is one of the most robust versions of linear regression
in which a small amount of bias is introduced so that we can get
better long term predictions.
•The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the
lambda to the squared weight of each individual features.
•The equation for ridge regression will be:
87

Ridge Regression
▪ A general linear or polynomial regression will fail if there is high
collinearity between the independent variables, so to solve such
problems, Ridge regression can be used.
▪ Ridge regression is a regularization technique, which is used to
reduce the complexity of the model. It is also called as L2
regularization.
▪ It helps to solve the problems if we have more parameters than
samples.
88

Lasso Regression
▪ Lasso regression is another regularization technique to reduce the
complexity of the model.
▪ It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
▪ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
▪ It is also called as L1 regularization. The equation for Lasso regression is:
90

Model Performance
https://byjus.com/maths/coefficient-of-determination/
https://aaweg-i.medium.com/what-precautions-we-need-to-keep-in-mind-when-using-coefficient-of-determination-98625e8bdb51
▪ The Goodness of fit determines how the line of regression fits the set of observations. The
process of finding the best model out of various models is called optimization. It can be
achieved by below method:
R-squared method:
▪ It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
▪ The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
▪ It is also called a coefficient of determination, or coefficient of multiple determination for
multiple regression.
91

Gradient Descent
RJEs: Remote job entry points https://www.javatpoint.com/gradient-descent-in-machine-learning
▪ Gradient descent is used to minimize the MSE
by calculating the gradient of the cost function.
▪ A regression model uses gradient descent to
update the coefficients of the line by reducing
the cost function.
▪ It is done by a random selection of values of
coefficient and then iteratively update the
values to reach the minimum cost function.
92

Gradient Descent
▪ Gradient Descent is known as one of the most commonly
used optimization algorithms to train machine learning
models by means of minimizing errors between actual
and expected results. Further, gradient descent is also
used to train Neural Networks.
▪ Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x)
parameterized by x.
▪ Similarly, in machine learning, optimization is the task of
minimizing the cost function parameterized by the model's
parameters. The main objective of gradient descent is to
minimize the convex function using iteration of parameter
updates.
93

What is Gradient Descent or Steepest Descent?
▪ Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms of
machine learning to train the machine learning and
deep learning models. It helps in finding the local
minimum of a function.
▪ If we move towards a negative gradient or away from
the gradient of the function at the current point, it will
give the local minimum of that function.
▪ Whenever we move towards a positive gradient or
towards the gradient of the function at the current point,
we will get the local maximum of that function.
▪ The main objective of using a gradient descent
algorithm is to minimize the cost function using
iteration.
94

Gradient Descent
https://www.javatpoint.com/gradient-descent-in-machine-learning
https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/
▪ Calculates the first-order derivative of the function to
compute the gradient or slope of that function.
▪ Move away from the direction of the gradient, which means
slope increased from the current point by alpha times, where
Alpha is defined as Learning Rate. It is a tuning parameter
in the optimization process which helps to decide the length
of the steps.
95

What is Cost-function?
▪ The cost function is defined as the measurement
of difference or error between actual values and
expected values.
▪ It helps to increase and improve machine learning
efficiency by providing feedback to this model so that
it can minimize error and find the local or global
minimum. Further, it continuously iterates along the
direction of the negative gradient until the cost
function approaches zero.
▪ At this steepest descent point, model will stop
learning further.
96

How does Gradient Descent work?
▪ The starting point (shown in fig.) is used to evaluate the
performance as it is considered just as an arbitrary point.
At this starting point, we will derive the first derivative or
slope and then use a tangent line to calculate the
steepness of this slope. Further, this slope will inform the
updates to the parameters (weights and bias).
▪ The slope becomes steeper at the starting point or
arbitrary point, but whenever new parameters are
generated, then steepness gradually reduces, and at the
lowest point, it approaches the lowest point, which is
called a point of convergence.
▪ The main objective of gradient descent is to minimize
the cost function or the error between expected and
actual. 97

Gradient Descent
Direction & Learning Rate
▪ These two factors are used to determine the partial derivative calculation of future iteration and
allow it to the point of convergence or local minimum or global minimum.
Learning Rate:
▪ It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost function. If the learning rate
is high, it results in larger steps but also leads to risks of overshooting the minimum. At the same
time, a low learning rate shows the small step sizes, which compromises overall efficiency but gives
the advantage of more precision.
98

Types of Gradient Descent
https://www.javatpoint.com/gradient-descent-in-machine-learning, https://www.analyticsvidhya.com/blog/2022/07/gradient-descent-and-its-types/
▪ Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into
▪ Batch gradient descent
▪ Mini-batch gradient descent
▪ Stochastic gradient descent
99

Batch Gradient Descent
▪ Batch Gradient Descent:
▪ Batch gradient descent (BGD) is used to find the error for each point in the
training set and update the model after evaluating all training examples of the
batch. This procedure is known as the training epoch.
▪ Advantages of Batch gradient descent:
▪ It produces less noise in comparison to other gradient
descent.
▪ It produces stable gradient descent convergence.
▪ It is Computationally efficient as all resources are used
for all training samples.
100

Stochastic gradient descent
▪ Stochastic gradient descent (SGD) is a type of gradient descent that runs one
training example per iteration. Or in other words, it processes a training epoch for
each example within a dataset and updates each training example's parameters
one at a time.
▪ As it requires only one training example at a time, hence it is easier to store in
allocated memory.
▪ However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows
frequent updates that require more detail and speed.
▪ Further, due to frequent updates, it is also treated as a
noisy gradient. However, sometimes it can be helpful in
finding the global minimum and also escaping the local
minimum.
101

Stochastic gradient descent
▪ Advantages of Stochastic gradient descent:
▪ In Stochastic gradient descent (SGD), learning happens on every example, and
it consists of a few advantages over other gradient descent.
▪ It is easier to allocate in desired memory.
▪ It is relatively fast to compute than batch gradient descent.
102

MiniBatch Gradient Descent:
▪ Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes
then performs the updates on those batches separately.
▪ Splitting training datasets into smaller batches make a balance to maintain the
computational efficiency of batch gradient descent and speed of stochastic gradient
descent.
▪ Hence, we can achieve a special type of gradient descent with higher computational
efficiency and less noisy gradient descent.
▪ Advantages of Mini Batch gradient descent:
▪ It is easier to fit in allocated memory.
▪ It is computationally efficient.
▪ It produces stable gradient descent convergence.
103

Challenges with the Gradient Descent
Local Minima and Saddle Point:
▪ For convex problems, gradient descent can find the global minimum
easily, while for non-convex problems, it is sometimes difficult to find
the global minimum, where the machine learning models achieve the
best results.
▪ Whenever the slope of the cost function is at zero or just close to
zero, this model stops learning further. Apart from the global
minimum, there occur some scenarios that can show this slop, which
is saddle point and local minimum. Local minima generate the shape
similar to the global minimum, where the slope of the cost function
increases on both sides of the current points.
▪ In contrast, with saddle points, the negative gradient only occurs on one side of the point, which reaches a
local maximum on one side and a local minimum on the other side. The name of a saddle point is taken by that
of a horse's saddle.
▪ The name of local minima is because the value of the loss function is minimum at that point in a local region. In
contrast, the name of the global minima is given so because the value of the loss function is minimum there,
globally across the entire domain the loss function.
104

Vanishing and Exploding Gradient
▪ Vanishing Gradients:
▪ Vanishing Gradient occurs when the gradient is smaller than expected. During backpropagation, this
gradient becomes smaller that causing the decrease in the learning rate of earlier layers than the later
layer of the network. Once this happens, the weight parameters update until they become insignificant.
▪ Exploding Gradient:
▪ Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too large.
Further, in this scenario, model weight increases, and they will be represented as NaN. This problem
can be solved using the dimensionality reduction technique, which helps to minimize complexity within
the model
105

Classification Algorithm in Machine Learning
What is the Classification Algorithm?
▪ The Classification algorithm is a Supervised Learning technique that is used
to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups. Such as,
Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be
called as targets/labels or categories.
▪ y = f(x), where y = categorical output
▪ The best example of an ML classification algorithm is Email Spam Detector.
▪ The main goal of the Classification algorithm is to identify the category of a
given dataset, and these algorithms are mainly used to predict the output for
the categorical data.
▪ Classification algorithms can be better understood using the diagram. In the
diagram, there are two classes, class A and Class B. These classes have
features that are similar to each other and dissimilar to other classes.
106

Classification Algorithm in Machine Learning
▪ The algorithm which implements the classification on a
dataset is known as a classifier.
▪ Binary Classifier: If the classification problem has only
two possible outcomes, then it is called as Binary
Classifier.
▪ Examples: YES or NO, MALE or FEMALE, SPAM or
NOT SPAM, CAT or DOG, etc.
▪ Multi-class Classifier: If a classification problem has
more than two outcomes, then it is called as Multi-class
Classifier.
▪ Example: Classifications of types of crops,
Classification of types of music.
107

Classification
▪ A Supervised Learning technique that is used to identify the category of new observations on the basis of training
data[1]
▪ In classification, the output is categorical unlike in regression where it was based on predicting ‘values’
▪ Types of classification[2]-
▪ Binary classification: When we have to categorize given data into 2 distinct classes. Example – On the basis
of given health conditions of a person, we have to determine whether the person has a certain disease or not
▪ Multiclass classification: The number of classes is more than 2. For Example – On the basis of data about
different species of flowers, we have to determine which specie our observation belongs
Ref: [1] https://www.javatpoint.com/classification-algorithm-in-machine-learning
[2] https://www.geeksforgeeks.org/getting-started-with-classification/?ref=lbp
108

Classification and its types
▪ General Block diagram of classification task:
Ref: https://www.geeksforgeeks.org/getting-started-with-classification/?ref=lbp
▪ There are various types of classifiers. Some of them are:
▪ Linear Classifiers: Logistic Regression
▪ Tree-Based Classifiers: Decision Tree Classifier
▪ Support Vector Machines
▪ Artificial Neural Networks
▪ Bayesian Regression
▪ Gaussian Naive Bayes Classifiers
▪ Stochastic Gradient Descent (SGD) Classifier
▪ Ensemble Methods: Random Forests, AdaBoost, Bagging Classifier, Voting Classifier, etc.
• X: Pre-classified data
• y: label/observations for X
• y’: predicted labels for X
109

Learners in Classification Problems
▪ In the classification problems, there are two types of
learners:
▪ Lazy Learners: Lazy Learner firstly stores the training
dataset and wait until it receives the test dataset. In Lazy
learner case, classification is done on the basis of the
most related data stored in the training dataset. It takes
less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
▪ Eager Learners: Eager Learners develop a classification
model based on a training dataset before receiving a test
dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction.
▪ Example: Decision Trees, Naïve Bayes, ANN. 110

Types of ML Classification Algorithms
▪ Classification algorithms can be further divided into the mainly two category:
▪ Linear Models
▪ Logistic Regression
▪ Support Vector Machines
▪ Non-linear Models
▪ K-Nearest Neighbours
▪ Kernel SVM
▪ Naïve Bayes
▪ Decision Tree Classification
▪ Random Forest Classification
111

Evaluating a Classification model
Log Loss or Cross-Entropy Loss:
▪ It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
▪ For a good binary Classification model, the value of log loss should be near to 0.
▪ The value of log loss increases if the predicted value deviates from the actual value.
▪ The lower log loss represents the higher accuracy of the model.
▪ For Binary classification, cross-entropy can be calculated as:
▪ Here, pi is the probability of class 1, and (1-pi) is the probability of class 0.
▪ When the observation belongs to class 1 the first part of the formula becomes active and the
second part vanishes and vice versa 112

Confusion Matrix
RJEs: Remote job entry points https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5
▪ The confusion matrix provides us a
matrix/table as output and describes
the performance of the model.
▪ It is also known as the error matrix.
▪ The matrix consists of predictions
result in a summarized form, which
has a total number of correct
predictions and incorrect
predictions.
113

Accuracy
Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio
between the number of correct predictions and the total number of predictions.
114

Precision
▪ It is a measure of correctness that is
achieved in true prediction. In simple
words, it tells us how many predictions
are actually positive out of all the
total positive predicted.
115

Recall
▪ It is a measure of actual
observations which are
predicted correctly, i.e. how
many observations of positive
class are actually predicted as
positive.
▪ It is also known as Sensitivity.
▪ Recall is a valid choice of
evaluation metric when we want
to capture as many positives
as possible.
116

F-measure / F1-Score
▪ The F1 score is a number between 0 and 1 and is the harmonic mean of
precision and recall. We use harmonic mean because it is not sensitive to
extremely large values, unlike simple averages.
117

Sensitivity & Specificity
118

Difference between Regression and Classification
Regression Algorithm Classification Algorithm
▪ In Regression, the output variable must be of
continuous nature or real value.
▪ In Classification, the output variable must be a discrete
value.
▪ The task of the regression algorithm is to map the input
value (x) with the continuous output variable(y).
▪ The task of the classification algorithm is to map the
input value(x) with the discrete output variable(y).
▪ Regression Algorithms are used with continuous data. ▪ Classification Algorithms are used with discrete data.
▪ In Regression, we try to find the best fit line, which
can predict the output more accurately.
▪ In Classification, we try to find the decision boundary,
which can divide the dataset into different classes.
▪ Regression algorithms can be used to solve the
regression problems such as Weather Prediction,
House price prediction, etc.
▪ Classification Algorithms can be used to solve
classification problems such as Identification of spam
emails, Speech Recognition, Identification of
cancer cells, etc.
▪ The regression Algorithm can be further divided into
Linear and Non-linear Regression.
▪ The Classification algorithms can be divided into
Binary Classifier and Multi-class Classifier.
119

Linear Regression Logistic Regression
Linear regression is used to predict the continuous
dependent variable using a given set of
independent variables.
Logistic Regression is used to predict the
categorical dependent variable using a given set
of independent variables.
Linear Regression is used for solving Regression
problem.
Logistic regression is used for solving
Classification problems.
In Linear regression, we predict the value of
continuous variables.
In logistic Regression, we predict the values of
categorical variables.
In linear regression, we find the best fit line, by
which we can easily predict the output.
In Logistic Regression, we find the S-curve by
which we can classify the samples.
Least square estimation method is used for
estimation of accuracy.
Maximum likelihood estimation method is used for
estimation of accuracy.
The output for Linear Regression must be a
continuous value, such as price, age, etc.
The output of Logistic Regression must be a
Categorical value such as 0 or 1, Yes or No, etc.
In Linear regression, it is required that relationship
between dependent variable and independent
variable must be linear.
In Logistic regression, it is not required to have the
linear relationship between the dependent and
independent variable.
In linear regression, there may be collinearity
between the independent variables.
In logistic regression, there should not be
collinearity between the independent variable.
120

Linear Regression vs. Logistic Regression
▪ The Linear Regression is used for solving Regression problems whereas Logistic
Regression is used for solving the Classification problems.
121

Clustering in Machine Learning
▪ Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects with the possible
similarities remain in a group that has less or no similarities with another group."
▪ It does it by finding some similar patterns
in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides
them as per the presence and absence of
those similar patterns.
▪ It is an unsupervised learning method;
hence no supervision is provided to the
algorithm, and it deals with the unlabelled
dataset.
122

Clustering in Machine Learning
▪ After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and
complex datasets.
▪ The clustering technique can be widely
used in various tasks.
▪ Market Segmentation
▪ Statistical data analysis
▪ Social network analysis
▪ Image segmentation
▪ Anomaly detection, etc.
123

Types of Clustering Methods
▪ The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to
another group also).
▪ Partitioning Clustering
▪ Density-Based Clustering
▪ Distribution Model-Based Clustering
▪ Hierarchical Clustering
▪ Fuzzy Clustering
124

Hierarchical Clustering in Machine Learning
▪ Hierarchical clustering is another unsupervised machine learning algorithm, which is
used to group the unlabeled datasets into a cluster and also known as hierarchical
cluster analysis or HCA.
▪ In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
▪ Sometimes the results of K-means clustering and hierarchical clustering may look
similar, but they both differ depending on how they work. As there is no requirement
to predetermine the number of clusters as we did in the K-Means algorithm.
▪ The hierarchical clustering technique has two approaches:
▪ Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts
with taking all data points as single clusters and merging them until one cluster is left.
▪ Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.
125

Hierarchical Clustering
▪ The clusters formed in this method form a tree-type structure called dendrogram based on the
hierarchy[1]
▪ New clusters are formed using the previously formed one
▪ It is divided into two categories:
▪ Agglomerative clustering: a bottom-up approach
▪ Divisive clustering: top-down approach
▪ Examples are CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies), etc.
▪ Agglomerative based dendrogram[2]:
Ref: [1] https://www.geeksforgeeks.org/clustering-in-machine-learning/
[2] https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019
126

Why hierarchical clustering?
▪ As we already have other clustering algorithms such as K-Means
Clustering, then why we need hierarchical clustering?
▪ As we have seen in the K-means clustering that there are some
challenges with this algorithm, which are a predetermined number of
clusters, and it always tries to create the clusters of the same size.
▪ To solve these two challenges, we can opt for the hierarchical
clustering algorithm
▪ In this algorithm, we don't need to have knowledge about the
predefined number of clusters.
127

Agglomerative Hierarchical clustering
▪ The agglomerative hierarchical clustering algorithm is a popular example of HCA.
▪ To group the datasets into clusters, it follows the bottom-up approach. It means,
this algorithm considers each dataset as a single cluster at the beginning, and
then start combining the closest pair of clusters together.
▪ It does this until all the
clusters are merged into a
single cluster that contains all
the datasets.
▪ This hierarchy of clusters is
represented in the form of the
dendrogram.
128

How the Agglomerative Hierarchical clustering Work?
▪ Step-1: Create each data point as a single cluster.
Let's say there are N data points, so the number of
clusters will also be N.
▪ Step-2: Take two closest data points or clusters
and merge them to form one cluster. So, there will
now be N-1 clusters.
▪ Step-3: Again, take the two closest clusters and
merge them together to form one cluster. There will
be N-2 clusters.
129

How the Agglomerative Hierarchical clustering Work?
▪ Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters.
▪ Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.
130

Measure for the distance between two clusters
▪ As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering.
▪ These measures are called Linkage methods.
▪ Single Linkage: It is the Shortest Distance between the closest points of the
clusters.
131

Measure for the distance between two clusters
▪ Complete Linkage: It is the farthest distance between
the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than
single-linkage.
▪ Average Linkage: It is the linkage method in which the
distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate
the average distance between two clusters.
▪ Centroid Linkage: It is the linkage method in which the
distance between the centroid of the clusters is
calculated.
132

Density-Based Clustering
▪ This method connects the highly-dense areas into clusters [1]
▪ These methods have good accuracy and the ability to merge two clusters [2]
▪ Type of clustering algorithms play a crucial role in evaluating and finding non-linear shape
structures based on density [3]
▪ The most popular density-based algorithm is DBSCAN which allows spatial clustering of data
with noise
▪ It makes use of two concepts – Data Reachability and Data Connectivity
Ref: [1] https://www.javatpoint.com/clustering-in-machine-learning, [4] https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html
[2] https://www.geeksforgeeks.org/clustering-in-machine-learning/, [3] https://www.geeksforgeeks.org/clustering-in-machine-learning/
▪ Density-based spatial clustering of applications with noise (DBSCAN):
▪ Based on the idea that a cluster in data space is a contiguous region of
high point density, separated from other such clusters by contiguous
regions of low point density [4]
▪ No need to explicitly define the number of clusters (K) like in K-Means
▪ The DBSCAN algorithm uses two parameters: 1) minPts: The minimum
number of points (a threshold) clustered together for a region to be
considered dense, 2) eps (ε): A distance measure that will be used to
locate the points in the neighborhood of any point
▪ There are three types of points after the DBSCAN clustering is
complete: 1) Core points 2) Border points 3) Noise points 133

Distribution Model-Based Clustering
▪ Here, the data is divided based on the probability of how a dataset belongs to a
particular distribution [1]
▪ The grouping is done by assuming some distributions commonly Gaussian Distribution
▪ The data observed arises from a distribution consisting of a mixture of two or more cluster
components [2]
▪ Furthermore, each component cluster has a density function having an associated
probability or weight in this mixture
▪ The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM) [1]. Two different examples of EM clustering are
represented below:
Ref: [1] https://www.javatpoint.com/clustering-in-machine-learning
[2] https://data-flair.training/blogs/clustering-in-machine-learning/
134

Partition Clustering
▪ It is a type of clustering that divides the data into non-hierarchical groups [1]
▪ It is also known as the centroid-based method
▪ These methods partition the objects into k clusters and each partition forms one cluster[2]
▪ This method is used to optimize an objective criterion similarity function
▪ The most common example of partitioning clustering is the K-Means Clustering
algorithm [1]
[2] https://www.geeksforgeeks.org/clustering-in-machine-learning/
▪ K-Means Clustering:
▪ It groups the unlabeled dataset into K clusters
▪ Main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding
clusters
▪ The algorithm takes the unlabeled dataset as input, divides
the dataset into k-number of clusters, and repeats the
process until it does not find the best clusters
▪ It determines the best value for K center points or centroids
by an iterative process
135

Fuzzy Clustering
▪ Fuzzy clustering is a type of soft method in which a
data object may belong to more than one group or
cluster [1]
▪ Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a
cluster
▪ Fuzzy C-means algorithm is the example of this type
of clustering; it is sometimes also known as the Fuzzy
k-means algorithm
[2] https://2-bitbio.com/post/clustering-rnaseq-data-using-fuzzy-c-means-clustering/
▪ In the adjacent image, K-means clustering
produces output based on minimum distance
calculation and is an example of hard clustering[2]
▪ Fuzzy c-means perform soft clustering by giving a
membership coefficient to the data points
▪ Fuzzy clustering is used to solve multiclass or
ambiguous clustering problems. 136

Applications of Clustering
▪ In Identification of Cancer Cells: The clustering algorithms are widely used for
the identification of cancerous cells. It divides the cancerous and non-cancerous
data sets into different groups.
▪ In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other dissimilar
objects. The accurate result of a query depends on the quality of the clustering
algorithm used.
▪ Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
▪ In Biology: It is used in the biology stream to classify different species of plants
and animals using the image recognition technique.
137

K-Means Clustering Algorithm
▪ K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-
defined clusters that need to be created in the
process, as if K=3, there will be three clusters, and for
K=4, there will be four clusters, and so on.
▪ It is an iterative algorithm that divides the unlabelled
dataset into k different clusters in such a way that each
dataset belongs only one group that has similar
properties.
▪ It allows us to cluster the data into different groups and
a convenient way to discover the categories of groups
in the unlabelled dataset on its own without the need
for any training.
138

▪ It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
▪ The algorithm takes the unlabelled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
▪ The k-means clustering algorithm mainly performs two
tasks:
▪ Determines the best value for K center points or
centroids by an iterative process.
▪ Assigns each data point to its closest k-center. Those
data points which are near to the particular k-center,
create a cluster.
▪ Hence each cluster has datapoints with some
commonalities, and it is away from other clusters. 139

How does the K-Means Algorithm Work?
▪ Step-1: Select the number K to decide the number of clusters.
▪ Step-2: Select random K points or centroids. (It can be other from the input dataset).
▪ Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
▪ Step-4: Calculate the variance and place a new centroid of each cluster.
▪ Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
▪ Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
▪ Step-7: The model is ready.
140

▪ Let's take number k of clusters, i.e., K=2, to identify the
dataset and to put them into different clusters. It means here
we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to
form the cluster. These points can be either the points from
the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our
dataset.
▪ Now we will assign each data point of the scatter plot to its
closest K-point or centroid. We will compute it by applying
some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median
between both the centroids.
141

▪ Points left side of the line is near to the K1 or
blue centroid, and points to the right of the line
are close to the yellow centroid.
▪ Let's color them as blue and yellow for clear
visualization.
142

▪ As we need to find the closest cluster, so we will
repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center
of gravity of these centroids, and will find new
centroids as below
▪ Next, we will reassign each datapoint to the new
centroid. For this, we will repeat the same process
of finding a median line.
143

▪ We can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three
points will be reassigned to new centroids
▪ As reassignment has taken place, so
we will again go to the step-4, which is
finding new centroids or K-points.
▪ We will repeat the process by finding
the center of gravity of centroids, so the
new centroids will be as shown in the
image
144

▪ As we got the new centroids so again will draw the
median line and reassign the data points.
▪ We can see in the image; there are no dissimilar
data points on either side of the line, which means
our model is formed.
145

▪ As our model is ready, so we can now remove the
assumed centroids, and the two final clusters will be
as shown in the below image
146

▪ How to choose the value of "K number of clusters" in K-means
Clustering?
▪ The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters is a
big task.
▪ There are some different ways to find the optimal number of clusters, but here
we are discussing the most appropriate method to find the number of clusters
or value of K.
▪ Elbow Method
147

Elbow Method
▪ The Elbow method is one of the most popular ways to find the optimal number of
clusters.
▪ This method uses the concept of WCSS value. WCSS stands for Within Cluster
Sum of Squares, which defines the total variations within a cluster. The formula
to calculate the value of WCSS (for 3 clusters) is given below
▪ ∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
148

▪ To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
▪ To find the optimal value of clusters, the elbow method follows the below steps:
▪ It executes the K-means clustering on a given dataset for different K values
(ranges from 1-10).
▪ For each value of K, calculates the WCSS value.
▪ Plots a curve between calculated WCSS values and the number of clusters K.
▪ The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
▪ Since the graph shows the sharp bend, which looks like
an elbow, hence it is known as the elbow method.
149

Decision Tree Classification Algorithm
RJEs: Remote job entry points https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
▪ Decision Tree is a Supervised learning technique that
can be used for both classification and Regression
problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a
dataset, branches represent the decision rules and
each leaf node represents the outcome.
▪ In a Decision tree, there are two nodes, which are the
Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
▪ The decisions or the test are performed on the basis of
features of the given dataset.
▪ It is a graphical representation for getting all the
possible solutions to a problem/decision based on
given conditions.
150

Decision Tree Classification Algorithm
RJEs: Remote job entry points https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
▪ It is called a decision tree because, similar to a
tree, it starts with the root node, which expands
on further branches and constructs a tree-like
structure.
▪ In order to build a tree, we use the CART
algorithm, which stands for Classification and
Regression Tree algorithm.
▪ A decision tree simply asks a question, and
based on the answer (Yes/No), it further split
the tree into subtrees.
151

Why use Decision Trees?
▪ Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
▪ The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
▪ Decision Tree Terminologies
▪ Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
▪ Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
▪ Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
▪ Branch/Sub Tree: A tree formed by splitting the tree.
▪ Pruning: Pruning is the process of removing the unwanted branches from the tree.
▪ Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes. 152

How does the Decision Tree algorithm Work?
▪ In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
▪ For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
▪ Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
▪ Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
▪ Step-3: Divide the S into subsets that contains possible values for the best attributes.
▪ Step-4: Generate the decision tree node, which contains the best attribute.
▪ Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
153

▪ Example: Suppose there is a candidate who
has a job offer and wants to decide whether
he should accept the offer or Not.
▪ So, to solve this problem, the decision tree
starts with the root node (Salary attribute by
ASM).
▪ The root node splits further into the next
decision node (distance from the office) and
one leaf node based on the corresponding
labels.
▪ The next decision node further gets split into
one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two
leaf nodes (Accepted offers and Declined
offer).
154

▪ Attribute Selection Measures
▪ While implementing a Decision tree, the
main issue arises that how to select the
best attribute for the root node and for sub-
nodes. So, to solve such problems there is
a technique which is called as Attribute
selection measure or ASM. By this
measurement, we can easily select the
best attribute for the nodes of the tree.
There are two popular techniques for ASM,
which are:
▪ Information Gain
▪ Gini Index
155

▪ Information Gain:
▪ Information gain is the measurement of
changes in entropy after the segmentation of
a dataset based on an attribute.
▪ It calculates how much information a feature
provides us about a class.
▪ According to the value of information gain,
we split the node and build the decision tree.
▪ A decision tree algorithm always tries to
maximize the value of information gain, and a
node/attribute having the highest information
gain is split first. It can be calculated using
the below formula:
Information Gain= Entropy (S) - [(Weighted Avg) *Entropy(each feature)]
156

Entropy: Entropy is a metric to measure the
impurity in a given attribute. It specifies
randomness in data.
Entropy(S)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
•S= Total number of samples
•P (yes)= probability of yes
•P (no)= probability of no
157

Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in
the CART (Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high
Gini index.
•It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
•Gini index can be calculated using the below formula:
158

Pruning: Getting an Optimal Decision tree
Ref: https://www.javatpoint.com/supervised-machine-learning, https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/
▪ Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.
▪ A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset.
▪ Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning.
▪ There are mainly two types of tree pruning technology used:
▪ Cost Complexity Pruning
▪ Reduced Error Pruning
159

Ref: https://www.javatpoint.com/supervised-machine-learning, https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/
160

Ref: https://www.javatpoint.com/supervised-machine-learning, https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/ [2]
161

Advantages/Disadvantages of the Decision Tree
▪ Advantages of the Decision Tree
▪ It is simple to understand as it follows the same process which a human
follow while making any decision in real-life.
▪ It can be very useful for solving decision-related problems.
▪ It helps to think about all the possible outcomes for a problem.
▪ There is less requirement of data cleaning compared to other algorithms.
▪ Disadvantages of the Decision Tree
▪ The decision tree contains lots of layers, which makes it complex.
▪ It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
▪ For more class labels, the computational complexity of the decision tree may
increase.
162

Python Implementation of Decision Tree
▪ Data Pre-processing step
▪ Fitting a Decision-Tree algorithm to the Training set
▪ Predicting the test result
▪ Test accuracy of the result (Creation of Confusion matrix)
▪ Visualizing the test set result
163

Data Pre-Processing Step
1.# importing libraries
2.import numpy as nm
3.import matplotlib.pyplot as mtp
4.import pandas as pd
5.#importing datasets
6.data_set= pd.read_csv('user_data.csv')
7.#Extracting Independent and dependent Variable
8.x= data_set.iloc[:, [2,3]].values
9.y= data_set.iloc[:, 4].values
10.# Splitting the dataset into training and test set.
11.from sklearn.model_selection import train_test_split
12.x_train, x_test, y_train, y_test= train_test_split (x, y, test_size= 0.25, random_state=0)
13. #feature Scaling
14.from sklearn.preprocessing import StandardScaler
15.st_x= StandardScaler()
16.x_train= st_x.fit_transform(x_train)
17.x_test= st_x.transform(x_test)
164

165

Fitting a Decision-Tree algorithm to the Training set
▪ Now we will fit the model to the training set. For this, we will import the
DecisionTreeClassifier class from sklearn.tree library. Below is the code for
it:
1.#Fitting Decision Tree classifier to the training set
2.From sklearn.tree import DecisionTreeClassifier
3.classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
4.classifier.fit(x_train, y_train)
▪ In the above code, we have created a classifier object, in which we have
passed two main parameters;
▪ "criterion='entropy': Criterion is used to measure the quality of split, which is
calculated by information gain given by entropy.
▪ random_state=0": For generating the random states.
166

▪ Now we will fit the model to the training set. For this, we will import
the DecisionTreeClassifier class from sklearn.tree library. Below is the code for it:
1.#Fitting Decision Tree classifier to the training set
2.From sklearn.tree import DecisionTreeClassifier
3.classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
▪ In the above code, we have created a classifier
object, in which we have passed two main
parameters;
▪ "criterion='entropy': Criterion is used to
measure the quality of split, which is calculated
by information gain given by entropy.
▪ random_state=0": For generating the random
states.
Out[8]:
DecisionTreeClassifier(class_weight=
None, criterion='entropy',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=0,
167

Out[8]: DecisionTreeClassifier(class_weight =None, criterion='entropy',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')
168

Predicting the test result
Now we will predict the test set result. We will create a new prediction
vector y_pred. Below is the code for it:
1.#Predicting the test set result
2.y_pred= classifier.predict(x_test)
In the below output image, the
predicted output and real test output
are given. We can clearly see that
there are some values in the
prediction vector, which are different
from the real vector values. These are
prediction errors.
169

Predicting the test result
170

Test accuracy of the result (Creation of Confusion matrix)
▪ In the above output, we have seen that there were
some incorrect predictions, so if we want to know the
number of correct and incorrect predictions, we need
to use the confusion matrix. Below is the code for it:
1.#Creating the Confusion matrix
2.from sklearn.metrics import confusion_matrix
3.cm= confusion_matrix(y_test, y_pred)
▪ In the above output image, we can see the confusion
matrix, which has 6+3= 9 incorrect predictions and
62+29=91 correct predictions.
171

Visualizing the training set result:
▪ Here we will visualize the training set result. To visualize the training set result we
will plot a graph for the decision tree classifier. The classifier will predict yes or No
for the users who have either Purchased or Not purchased the SUV through
Logistic Regression. Below is the code for it:
▪ The above output is completely different from
the rest classification models. It has both
vertical and horizontal lines that are splitting the
dataset according to the age and estimated
salary variable.
▪ As we can see, the tree is trying to capture each
dataset, which is the case of overfitting.
172

Visualizing the training set result:
1.#Visulaizing the trianing set result
2.from matplotlib.colors import ListedColormap
3.x_set, y_set = x_train, y_train
4.x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() -
1, stop = x_set[:, 0].max() + 1, step =0.01),
5.nm.arange(start = x_set[:, 1].min() -
1, stop = x_set[:, 1].max() + 1, step = 0.01))
6.mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.
ravel()]).T).reshape(x1.shape),
7.alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8.mtp.xlim(x1.min(), x1.max())
9.mtp.ylim(x2.min(), x2.max())
10.fori, j in enumerate(nm.unique(y_set)):
11.mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Decision Tree Algorithm (Training set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()
173

Visualizing the test set result:
▪ Visualization of test set result will
be similar to the visualization of
the training set except that the
training set will be replaced with
the test set.
1.#Visulaizing the test set result
3.x_set, y_set = x_test, y_test
7.alpha = 0.75, cmap = ListedColormap(('purple','green' )))
10.fori, j in enumerate(nm.unique(y_set)):
11.mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13.mtp.title('Decision Tree Algorithm(Test set)')
16.mtp.legend()
17.mtp.show()
174

Visualizing the test set result:
▪ As we can see in the above
image that there are some
green data points within the
purple region and vice
versa.
▪ So, these are the incorrect
predictions which we have
discussed in the confusion
matrix.
175

K-Nearest Neighbor (KNN) Algorithm
176

RJEs: Remote job entry points https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
▪ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
▪ K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the
category that is most similar to the available categories.
▪ K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new
data appears then it can be easily classified into a well suite category by using K- NN algorithm.
▪ K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
▪ K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
▪ It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on the dataset.
▪ KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a
category that is much similar to the new data.
▪ Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either
cat or dog category.
177

178

Why do we need a K-NN Algorithm?
▪ Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm.
▪ With the help of K-NN, we can easily identify the category or class of a particular
dataset.
179

How does K-NN work?
▪ Step-1: Select the number K of the
neighbors
▪ Step-2: Calculate the Euclidean distance
of K number of neighbors
▪ Step-3: Take the K nearest neighbors as
per the calculated Euclidean distance.
▪ Step-4: Among these k neighbors, count
the number of the data points in each
category.
▪ Step-5: Assign the new data points to
that category for which the number of the
neighbor is maximum.
180

How does K-NN work?
▪ Suppose we have a new data point and
we need to put it in the required category.
Consider the image:
▪ Firstly, we will choose the number of
neighbors, so we will choose the k=5.
▪ Next, we will calculate the Euclidean
distance between the data points.
▪ The Euclidean distance is the distance
between two points, It can be calculated
as follows:
181

How does K-NN work?
▪ By calculating the Euclidean distance we
got the nearest neighbors, as three nearest
neighbors in category A and two nearest
neighbors in category B. Consider the
below image:
▪ As we can see the 3 nearest neighbors are
from category A, hence this new data point
must belong to category A.
182

How to select the value of K in the K-NN Algorithm?
▪ Below are some points to remember while selecting the value of K in
the K-NN algorithm:
▪ There is no particular way to determine the best value for "K", so
we need to try some values to find the best out of them.
▪ The most preferred value for K is 5.
▪ A very low value for K such as K=1 or K=2, can be noisy and lead to
the effects of outliers in the model.
183

Advantages / Disadvantages of KNN Algorithm
▪ Advantages of KNN Algorithm
▪ It is simple to implement
▪ It is robust to the noisy training data
▪ It can be more effective if the training data is large.
▪ Disadvantages of KNN Algorithm
▪ Always needs to determine the value of K which may be complex some time.
▪ The computation cost is high because of calculating the distance
between the data points for all the training samples.
184

Python implementation of the KNN algorithm
▪ Problem statement: There is a Car
manufacturer company that has
manufactured a new SUV car.
▪ The company wants to give the ads to the
users who are interested in buying that
SUV.
▪ So for this problem, we have a dataset that
contains multiple user's information
through the social network.
▪ The dataset contains lots of information but
the Estimated Salary and Age we will
consider for the independent variable and
the Purchased variable is for the
dependent variable.
185

Steps to implement the K-NN algorithm
▪ Data Pre-processing step
▪ Fitting the K-NN algorithm to the Training set
▪ Predicting the test result
▪ Test accuracy of the result(Creation of Confusion matrix)
▪ Visualizing the test set result.
186

5. #importing datasets
6.data_set= pd.read_csv('user_data.csv')
7. #Extracting Independent and dependent Variable
8.x= data_set.iloc[:, [2,3]].values
9.y= data_set.iloc[:, 4].values
10. # Splitting the dataset into training and test set.
12.x_train, x_test, y_train, y_test= train_test_split (x, y, test_size= 0.25, random_state=0)
13. #feature Scaling
187

▪ By executing the
above code, our
dataset is imported
to our program and
well pre-
processed. After
feature scaling our
test dataset will
look like
▪ From the above
output image, we
can see that our
data is
successfully scaled
188

Fitting K-NN classifier to the Training data
▪ Now we will fit the K-NN classifier to the training data.
▪ To do this we will import the KNeighborsClassifier class of Sklearn Neighbors library.
▪ After importing the class, we will create the Classifier object of the class.
▪ The Parameter of this class will ben_neighbors: To define the required neighbors of the
algorithm. Usually, it takes 5.
▪ metric='minkowski': This is the default parameter and it decides the distance between the
points.
▪ p=2: It is equivalent to the standard Euclidean metric.
▪ And then we will fit the classifier to the training data.
1.#Fitting K-
NN classifier to the training set
2.from sklearn.neighbors import KNeigh
borsClassifier
3.classifier= KNeighborsClassifier(n_nei
ghbors=5, metric='minkowski', p=2 )
189

Python implementation of the KNN algorithm
Output: By executing the above code, we will get
the output as:
Out[10]:
KNeighborsClassifier
(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
▪ Predicting the Test Result: To predict the
test set result
1.#Predicting the test set result
2.y_pred= classifier.predict (x_test)
190

Creating the Confusion Matrix
▪ Now we will create the Confusion Matrix for our
K-NN model to see the accuracy of the classifier.
▪ #Creating the Confusion matrix
▪ from sklearn.metrics import confusion_matrix
▪ cm= confusion_matrix (y_test, y_pred)
▪ In above code, we have imported the
confusion_matrix function and called it using the
variable cm.
▪ Output: By executing the above code, we will get
the matrix as shown in the image:
191

Confusion Matrix
▪ In the image, we can see
there are 64+29= 93 correct
predictions
▪ 3+4= 7 incorrect predictions
192

Visualizing the Training set result
1.#Visulaizing the trianing set result
4.x1, x2 = nm.meshgrid (nm.arange(start = x_set[:, 0].min() -
7.alpha = 0.75, cmap = ListedColormap(('red','green' )))
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13.mtp.title('K-NN Algorithm (Training set)')
16.mtp.legend()
17.mtp.show()
193

Visualizing the Training set result
▪ The above graph is showing the output
for the test data set.
▪ As we can see in the graph, the
predicted output is well good as most of
the red points are in the red region and
most of the green points are in the green
region
▪ However, there are few green points in
the red region and a few red points in the
green region. So these are the incorrect
observations that we have observed in
the confusion matrix (7 Incorrect output).
194

Support Vector Machine
▪ Support Vector Machine or SVM is one of
the most popular Supervised Learning
algorithms, which is used for Classification
as well as Regression problems.
▪ However, primarily, it is used for
Classification problems in Machine
Learning.
▪ The goal of the SVM algorithm is to create
the best line or decision boundary that can
segregate n-dimensional space into
classes so that we can easily put the new
data point in the correct category in the
future.
196

▪ This best decision boundary is called a
hyperplane.
▪ SVM chooses the extreme
points/vectors that help in creating the
hyperplane.
▪ These extreme cases are called as
support vectors, and hence algorithm is
termed as Support Vector Machine.
▪ Consider the diagram in which there are
two different categories that are
classified using a decision boundary or
hyperplane:
197

▪ Example: Suppose we see a strange cat that also has some features of dogs, so
if we want a model that can accurately identify whether it is a cat or dog, so such
a model can be created by using the SVM algorithm.
▪ We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this
strange creature.
▪ So as support vector creates a decision
boundary between these two data (cat
and dog) and choose extreme cases
(support vectors), it will see the extreme
case of cat and dog. On the basis of the
support vectors, it will classify it as a cat.
Consider the below diagram:
198

Types of SVM
SVM can be of two types:
▪ Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
▪ Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-linear
SVM classifier.
199

Hyperplane and Support Vectors in the SVM algorithm
▪ Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.
▪ The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
▪ We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
▪ The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
200

How does SVM works?
▪ Linear SVM:
▪ The working of the SVM algorithm can be understood by using an example.
▪ Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2.
▪ We want a classifier that can classify the pair (x1,
x2) of coordinates in either green or blue. Consider
the image:
▪ So as it is 2-d space so by just using a straight line,
we can easily separate these two classes. But there
can be multiple lines that can separate these
classes. Consider the below image:
201

How does SVM works?
▪ Hence, the SVM algorithm helps to find the best
line or decision boundary; this best boundary or
region is called as a hyperplane.
▪ SVM algorithm finds the closest point of the lines
from both the classes. These points are called
support vectors.
▪ The distance between the vectors and the
hyperplane is called as margin.
▪ And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called
the optimal hyperplane.
202

How does SVM works?
▪ Non-Linear SVM:
▪ If data is linearly arranged, then we can separate it
by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the
below image:
▪ So to separate these data points, we need to add
one more dimension.
▪ For linear data, we have used two dimensions x and
y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z = x2 +y2
203

How does SVM works?
▪ By adding the third dimension, the
sample space will become as below
image:
▪ So now, SVM will divide the datasets into
classes in the following way.
z = x2 +y2
204

How does SVM works?
▪ Since we are in 3-d Space, hence it is
looking like a plane parallel to the x-axis.
▪ If we convert it in 2d space with z=1, then
it will become as:
z = x2 +y2
205

Working of Unsupervised Learning
▪ Hence we get a circumference of radius 1 in case of non-linear data.
z = x2 +y2
206

Data Pre-processing step
1.#Data Pre-processing Step
6. #importing datasets
7.data_set= pd.read_csv ('user_data.csv')
8. #Extracting Independent and dependent Variable
9.x= data_set.iloc [:, [2,3]].values
10.y= data_set.iloc [:, 4].values
11.# Splitting the dataset into training and test set.
13.x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
14.#feature Scaling
207

▪ After executing the above code,
we will pre-process the data. The
code will give the dataset as:
208

The scaled output for the test set will be:
209

Fitting the SVM classifier to the training set
▪ Now the training set will be fitted to the SVM classifier.
▪ To create the SVM classifier, we will import SVC class from Sklearn.svm library.
▪ Below is the code for it:
▪ from sklearn.svm import SVC # "Support vector classifier"
▪ classifier = SVC(kernel='linear', random_state=0)
▪ classifier.fit(x_train, y_train)
▪ In the above code, we have used kernel='linear', as here we are creating SVM
for linearly separable data. However, we can change it for non-linear data. And
then we fitted the classifier to the training dataset(x_train, y_train)
210

Output
Out[8]:
SVC (C=1.0,
cache_size=200,
class_weight=None,
coef0=0.0,
decision_function_shape='ovr',
degree=3,
gamma='auto_deprecated',
kernel='linear',
max_iter=-1,
probability=False,
random_state=0,
shrinking=True,
tol=0.001,
verbose=False)
▪ The model performance can be altered
by changing the value of
▪ C (Regularization factor),
▪ gamma,
▪ kernel.
211

Predicting the test set result
▪ Now, we will predict the output for test set.
For this, we will create a new vector y_pred.
Below is the code for it:
▪ #Predicting the test set result
▪ y_pred= classifier.predict (x_test)
▪ After getting the y_pred vector, we can
compare the result of y_pred and y_test to
check the difference between the actual
value and predicted value.
▪ Output: Image is the output for the
prediction of the test set:
212

Creating the confusion matrix
▪ Now we will see the performance of the SVM
classifier that how many incorrect predictions are
there.
▪ To create the confusion matrix, we need to import
the confusion_matrix function of the sklearn library.
▪ After importing the function, we will call it using a new
variable cm.
▪ The function takes two parameters, mainly y_true
(the actual values) and y_pred (the targeted value
return by the classifier).
1.#Creating the Confusion matrix
2.from sklearn.metrics import confusion_matrix
3.cm= confusion_matrix(y_test, y_pred)
213

Creating the confusion matrix
▪ As we can see in the output
image, there are 66+24= 90
correct predictions and 8+2= 10
correct predictions.
214

Visualizing the training set result
3.x1, x2 = nm.meshgrid (nm.arange(start = x_set[:, 0].min() -
4.nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5.mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
6.alpha = 0.75, cmap = ListedColormap(('red', 'green')))
12.mtp.title('SVM classifier (Training set)')
15.mtp.legend()
16.mtp.show()
215

Visualizing the training set result
▪ Output
▪ As we can see, the above output is appearing
similar to the Logistic regression output.
▪ In the output, we got the straight line as
hyperplane because we have used a linear
kernel in the classifier.
▪ We have also discussed above that for the 2d
space, the hyperplane in SVM is a straight
line.
216

Visualizing the test set result
1.#Visulaizing the test set result
3.x_set, y_set = x_test, y_test
6.mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()
]).T).reshape(x1.shape),
7.alpha = 0.75, cmap = ListedColormap(('red','green' )))
13.mtp.title('SVM classifier (Test set)')
16.mtp.legend()
17.mtp.show()
217

Visualizing the test set result
▪ As we can see in the above output
image, the SVM classifier has divided the
users into two regions (Purchased or Not
purchased).
▪ Users who purchased the SUV are in the
red region with the red scatter points.
▪ Users who did not purchase the SUV are
in the green region with green scatter
points.
▪ The hyperplane has divided the two
classes into Purchased and not
purchased variable.
218

Semi-Supervised Learning
▪ Also known as transductive learning
▪ Uses both labeled and unlabeled data to perform an otherwise supervised learning or unsupervised
learning task
▪ Initially motivated by its practical value in learning faster, better, and cheaper
▪ Has applications in cognitive psychology as a computational model for human learning
▪ It comprises a smaller part of the supervised learning method and a larger part of the unlabeled
component
▪ Some of the applications are in text classification, iterative co-training-based applications such as
webpage classification, lane finding on GPS data, etc.
Ref: https://pages.cs.wisc.edu/~jerryzhu/pub/SSL_EoML.pdf
220

Algorithm Flow
▪ Semi-Supervised learning Algorithm Flow
Ref: https://www.cs.cmu.edu/~ninamf/courses/401sp18/lectures/ssl-04-18.pdf
▪ The models based on this are semi-supervised SVM, graph-based models, generative models, etc.
221

Major Kernel Functions in Support
Vector Machine
222

Major Kernel Functions in Support Vector Machine
What is Kernel Method?
▪ A set of techniques known as kernel methods are used in machine learning to address
classification, regression, and other prediction issues. They are built around the idea of kernels,
which are functions that gauge how similar two data points are to one another in a high-
dimensional feature space.
▪ Kernel methods' fundamental premise is used to convert the input data into a high-dimensional
feature space, which makes it simpler to distinguish between classes or generate predictions.
Kernel methods employ a kernel function to implicitly map the data into the feature space, as
opposed to manually computing the feature space.
▪ The most popular kind of kernel approach is the Support Vector Machine (SVM), a binary
classifier that determines the best hyperplane that most effectively divides the two groups. In
order to efficiently locate the ideal hyperplane, SVMs map the input into a higher-dimensional
space using a kernel function.
▪ Other examples of kernel methods include kernel ridge regression, kernel PCA, and Gaussian
processes. Since they are strong, adaptable, and computationally efficient, kernel approaches
are frequently employed in machine learning. They are resilient to noise and outliers and can
handle sophisticated data structures like strings and graphs.
223

Kernel Method in SVMs
▪ Support Vector Machines (SVMs) use kernel methods to transform the input data into a higher-
dimensional feature space, which makes it simpler to distinguish between classes or generate
predictions.
▪ Kernel approaches in SVMs work on the fundamental principle of implicitly mapping input data into
a higher-dimensional feature space without directly computing the coordinates of the data points in
that space.
▪ The kernel function in SVMs is essential in determining the decision boundary that divides the
various classes.
▪ In order to calculate the degree of similarity between any two points in the feature space, the
kernel function computes their dot product.
▪ The most commonly used kernel function in SVMs is the Gaussian or radial basis function (RBF)
kernel. The RBF kernel maps the input data into an infinite-dimensional feature space using a
Gaussian function. This kernel function is popular because it can capture complex nonlinear
relationships in the data.
224

Kernel Method in SVMs
▪ Other types of kernel functions that can be used in SVMs include the polynomial kernel, the
sigmoid kernel, and the Laplacian kernel. The choice of kernel function depends on the specific
problem and the characteristics of the data.
▪ Basically, kernel methods in SVMs are a powerful technique for solving classification and
regression problems, and they are widely used in machine learning because they can handle
complex data structures and are robust to noise and outliers.
225

Characteristics of Kernel Function
▪ Mercer's condition: A kernel function must satisfy Mercer's condition to be valid.
This condition ensures that the kernel function is positive semi definite, which
means that it is always greater than or equal to zero.
▪ Positive definiteness: A kernel function is positive definite if it is always greater
than zero except for when the inputs are equal to each other.
▪ Non-negativity: A kernel function is non-negative, meaning that it produces non-
negative values for all inputs.
▪ Symmetry: A kernel function is symmetric, meaning that it produces the same
value regardless of the order in which the inputs are given.
226

Characteristics of Kernel Function
▪ Reproducing property: A kernel function satisfies the reproducing property if it
can be used to reconstruct the input data in the feature space.
▪ Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
▪ Complexity: The complexity of a kernel function is an important consideration,
as more complex kernel functions may lead to over fitting and reduced
generalization performance.
227

Selecting an appropriate kernel function
▪ Basically, the choice of kernel function depends on the specific problem and the characteristics of
the data, and selecting an appropriate kernel function can significantly impact the performance of
machine learning algorithms.
▪ Major Kernel Function in Support Vector Machine
▪ In Support Vector Machines (SVMs), there are several types of kernel functions that can be used
to map the input data into a higher-dimensional feature space. The choice of kernel function
depends on the specific problem and the characteristics of the data.
228

Linear Kernel
RJEs: Remote job entry points https://www.javatpoint.com/
▪ A linear kernel is a type of kernel function used in machine learning, including in SVMs (Support
Vector Machines). It is the simplest and most commonly used kernel function, and it defines the dot
product between the input vectors in the original feature space.
▪ The linear kernel can be defined as:
K(x, y) = x.y
▪ Where x and y are the input feature vectors.
▪ The dot product of the input vectors is a measure of their similarity or distance in the original feature
space.
▪ When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that separates
the different classes in the feature space.
▪ This linear boundary can be useful when the data is already separable by a linear decision boundary
or when dealing with high-dimensional data, where the use of more complex kernel functions may
lead to overfitting 229

Polynomial Kernel
▪ It is a nonlinear kernel function that employs polynomial functions to transfer the input data into a
higher-dimensional feature space.
▪ One definition of the polynomial kernel is:
▪ Where x and y are the input feature vectors, c is a constant term, and d is the degree of the
polynomial,
K(x, y) = (x. y + c)d.
▪ The constant term is added to, and the dot product of the input vectors elevated to the degree of
the polynomial.
▪ The decision boundary of an SVM with a polynomial kernel might capture more intricate
correlations between the input characteristics because it is a nonlinear hyperplane.
▪ The degree of nonlinearity in the decision boundary is determined by the degree of the polynomial
230

Polynomial Kernel
▪ The polynomial kernel has the benefit of being able to detect both linear and nonlinear correlations in the data.
▪ It can be difficult to select the proper degree of the polynomial, though, as a larger degree can result in overfitting
while a lower degree cannot adequately represent the underlying relationships in the data.
▪ In general, the polynomial kernel is an effective tool for converting the input data into a higher-dimensional feature
space in order to capture nonlinear correlations between the input characteristics.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular kernel function used in machine
learning, particularly in SVMs (Support Vector Machines). It is a nonlinear kernel function that maps the input data into a
higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:
1.K(x, y) = exp(-gamma * ||x - y||^2)
231

Gaussian (RBF) Kernel
K(x, y) = exp(-gamma * ||x - y||^2)
▪ One advantage of the Gaussian kernel is its ability to capture complex relationships in the data
without the need for explicit feature engineering.
▪ However, the choice of the gamma parameter can be challenging, as a smaller value may result in
under fitting, while a larger value may result in over fitting.
232

Laplace Kernel
▪ The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type of kernel function used in
machine learning, including in SVMs (Support Vector Machines). It is a non-parametric kernel that can be used to
measure the similarity or distance between two input feature vectors.
▪ The Laplacian kernel can be defined as:
K(x, y) = exp(-gamma * ||x - y||)
▪ Where x and y are the input feature vectors, gamma is a parameter that controls the width of the Laplacian function, and
||x - y|| is the L1 norm or Manhattan distance between the input vectors.
▪ When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear hyperplane that can capture complex
relationships between the input features. The width of the Laplacian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.
▪ One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight on large distances between the
input vectors than the Gaussian kernel. However, like the Gaussian kernel, choosing the correct value of the gamma
parameter can be challenging.
233

Reinforcement Learning
RJEs: Remote job entry points https://www.javatpoint.com/reinforcement-learning
237
▪ What is Reinforcement Learning?
▪ Terms used in Reinforcement Learning.
▪ Key features of Reinforcement Learning.
▪ Elements of Reinforcement Learning.
▪ Approaches to implementing Reinforcement Learning.
▪ How does Reinforcement Learning Work?
▪ The Bellman Equation.
▪ Types of Reinforcement Learning.
▪ Reinforcement Learning Algorithm.
▪ Markov Decision Process.
▪ What is Q-Learning?
▪ Difference between Supervised Learning and Reinforcement Learning.
▪ Applications of Reinforcement Learning.
▪ Conclusion.

Reinforcement Learning Tutorial
238
▪ Reinforcement Learning is a feedback-based
Machine learning technique in which an agent
learns to behave in an environment by
performing the actions and seeing the results
of actions. For each good action, the agent
gets positive feedback, and for each bad
action, the agent gets negative feedback or
penalty.
▪ In Reinforcement Learning, the agent learns
automatically using feedbacks without any
labeled data, unlike supervised learning.
▪ Since there is no labeled data, so the agent is
bound to learn by its experience only.

239
▪ RL solves a specific type of problem where
decision making is sequential, and the goal is
long-term, such as game-playing, robotics,
etc.
▪ The agent interacts with the environment and
explores it by itself.
▪ The primary goal of an agent in reinforcement
learning is to improve the performance by
getting the maximum positive rewards.

240
▪ The agent learns with the process of hit and trial, and based on the experience, it
learns to perform the task. Hence, we can say that "Reinforcement learning is a
type of machine learning method where an intelligent agent (computer
program) interacts with the environment and learns to act within that."
▪ It is a core part of Artificial intelligence, and all AI
agent works on the concept of reinforcement learning.
Here we do not need to pre-program the agent, as it
learns from its own experience without any human
intervention.
▪ Example: Suppose there is an AI agent present within a
maze environment, and his goal is to find the diamond.
The agent interacts with the environment by performing
some actions, and based on those actions, the state of
the agent gets changed, and it also receives a reward or
penalty as feedback.

241
▪ The agent continues doing these three
things (take action, change state/remain
in the same state, and get feedback),
and by doing these actions, he learns and
explores the environment.
▪ The agent learns that what actions lead to
positive feedback or rewards and what
actions lead to negative feedback penalty.
As a positive reward, the agent gets a
positive point, and as a penalty, it gets a
negative point.

Terms used in Reinforcement Learning
242
▪ Agent(): An entity that can perceive/explore the environment and act upon it.
▪ Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the
stochastic environment, which means it is random in nature.
▪ Action(): Actions are the moves taken by an agent within the environment.
▪ State(): State is a situation returned by the environment after each action taken by the agent.
▪ Reward(): A feedback returned to the agent from the environment to evaluate the action.
▪ Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
▪ Value(): It is expected long-term retuned with the discount factor and opposite to the short-term
reward.
▪ Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action.

Key Features of Reinforcement Learning
243
▪ In RL, the agent is not instructed about the
environment and what actions need to be taken.
▪ It is based on the hit and trial process.
▪ The agent takes the next action and changes
states according to the feedback of the previous
action.
▪ The agent may get a delayed reward.
▪ The environment is stochastic, and the agent
needs to explore it to reach to get the maximum
positive rewards.

Approaches to implement Reinforcement Learning
244
▪ Value-based:
The value-based approach is about to find the optimal value function, which is the maximum value
at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under
policy π.
▪ Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without using
the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward.
▪ The policy-based approach has mainly two types of policy:
▪ Deterministic: The same action is produced by the policy (π) at any state.
▪ Stochastic: In this policy, probability determines the produced action.
▪ Model-based: In the model-based approach, a virtual model is created for the environment, and
the agent explores that environment to learn it. There is no particular solution or algorithm for this
approach because the model representation is different for each environment.

Elements of Reinforcement Learning
245
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

Policy
246
▪ A policy can be defined as a way how an agent behaves at a given time.
▪ It maps the perceived states of the environment to the actions taken on those
states.
▪ A policy is the core element of the RL as it alone can define the behavior of the
agent.
▪ In some cases, it may be a simple function or a lookup table, whereas, for other
cases, it may involve general computation as a search process.
▪ It could be deterministic or a stochastic policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]

Reward Signal
247
▪ The goal of reinforcement learning is defined by the
reward signal.
▪ At each state, the environment sends an immediate
signal to the learning agent, and this signal is known as
a reward signal.
▪ These rewards are given according to the good and bad
actions taken by the agent.
▪ The agent's main objective is to maximize the total
number of rewards for good actions.
▪ The reward signal can change the policy, such as if an
action selected by the agent leads to low reward, then
the policy may change to select other actions in the
future.

Value Function
248
▪ The value function gives information about
how good the situation and action are and how
much reward an agent can expect.
▪ A reward indicates the immediate signal for
each good and bad action, whereas a value
function specifies the good state and action
for the future.
▪ The value function depends on the reward
as, without reward, there could be no value.
The goal of estimating values is to achieve
more rewards.

Model
249
▪ Model mimics the behavior of the environment.
With the help of the model, one can make
inferences about how the environment will behave.
Such as, if a state and an action are given, then a
model can predict the next state and reward.
▪ The model is used for planning, which means it
provides a way to take a course of action by
considering all future situations before actually
experiencing those situations. The approaches for
solving the RL problems with the help of the
model are termed as the model-based approach.
Comparatively, an approach without using a
model is called a model-free approach.

How does Reinforcement Learning Work?
250
▪ To understand the working process
of the RL, we need to consider two
main things:
▪ Environment: It can be anything
such as a room, maze, football
ground, etc.
▪ Agent: An intelligent agent such as
AI robot. Let's take an example of a
maze environment that the agent
needs to explore.

251
▪ In the above image, the agent is at the very first
block of the maze. The maze is consisting of an
S6 block, which is a wall, S8 a fire pit, and
S4 a diamond block.
▪ The agent cannot cross the S6 block, as it is a solid
wall. If the agent reaches the S4 block, then get
the +1 reward; if it reaches the fire pit, then gets -1
reward point. It can take four actions: move up,
move down, move left, and move right.
▪ The agent can take any path to reach to the final
point, but he needs to make it in possible fewer
steps. Suppose the agent considers the path S9-S5-
S1-S2-S3, so he will get the +1-reward point.
▪ The agent will try to remember the preceding steps
that it has taken to reach the final step. To memorize
the steps, it assigns 1 value to each previous step.

252
▪ Now, the agent has successfully stored the
previous steps assigning the 1 value to each
previous block.
▪ But what will the agent do if he starts moving
from the block, which has 1 value block on both
sides?
▪ It will be a difficult condition for the agent whether
he should go up or down as each block has the
same value. So, the above approach is not suitable
for the agent to reach the destination. Hence to
solve the problem, we will use the Bellman
equation, which is the main concept behind
reinforcement learning.

The Bellman Equation
253
▪ The Bellman equation was introduced by the Mathematician Richard Ernest
Bellman in the year 1953, and hence it is called as a Bellman equation. It is
associated with dynamic programming and used to calculate the values of a
decision problem at a certain point by including the values of previous states.
▪ It is a way of calculating the value functions in dynamic programming or
environment that leads to modern reinforcement learning.
▪ The key-elements used in Bellman equations are:
▪ Action performed by the agent is referred to as "a"
▪ State occurred by performing the action is "s."
▪ The reward/feedback obtained for each good and bad action is "R."
▪ A discount factor is Gamma "γ.“
▪ The Bellman equation can be written as: V(s) = max [R(s,a) + γV(s`)]

254
▪ The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
▪ V(s)= value calculated at a particular point.
▪ R(s,a) = Reward at a particular state s by performing an action.
▪ γ = Discount factor
▪ V(s`) = The value at the previous state.
▪ In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
▪ So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.

255
▪ For 1st block:
▪ V(s3) = max [R(s,a) + γV(s`)],
▪ here V(s')= 0
▪ because there is no further state to move.
▪ V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.

256
▪ For 2nd block:
▪ V(s2) = max [R(s,a) + γV(s`)],
▪ here γ= 0.9 (lets), V(s')= 1, and R(s, a)= 0,
▪ because there is no reward at this state.
▪ V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9

257
▪ For 3rd block:
▪ V(s1) = max [R(s,a) + γV(s`)],
▪ here γ= 0.9 (lets),
▪ V(s')= 0.9, and R(s, a)= 0, because there is no
reward at this state also.
▪ V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

258
▪ For 4th block:
▪ V(s5) = max [R(s,a) + γV(s`)],
▪ here γ= 0.9 (lets),
▪ V(s')= 0.81, and
▪ R(s, a) = 0, because there is no reward at this state also.
▪ V(s5)= max[0.9(0.81)]=> V(s5)= max[0.9*0.81]=> V(s5) =0.73

259
▪ For 5th block:
▪ V(s9) = max [R(s,a) + γV(s`)],
▪ here γ= 0.9(lets),
▪ V(s')= 0.73, and R(s, a)= 0,
▪ because there is no reward at this state also.
▪ V(s9)= max[0.9(0.73)]=> V(s4)= max[0.9*0.73]=> V(s4) =0.66

260
▪ Now, the agent has three options to move
▪ if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will
get the -1 reward.
▪ But here we are taking only positive rewards,
so for this, he will move to upwards only.
▪ The complete block values will be calculated
using this formula.

Types of Reinforcement learning
261

Types of Reinforcement learning
262
▪ There are mainly two types of reinforcement learning, which are:
▪ Positive Reinforcement
▪ Negative Reinforcement

Positive Reinforcement
https://www.javatpoint.com/reinforcement-learning, https://www.verywellmind.com/what-is-positive-reinforcement-2795412
263
▪ The positive reinforcement learning means adding something to increase the
tendency that expected behavior would occur again. It impacts positively on the
behavior of the agent and increases the strength of the behavior.
▪ This type of reinforcement can sustain the changes for a long time, but too much
positive reinforcement may lead to an overload of states that can reduce the
consequences.

Negative Reinforcement
264
▪ The negative reinforcement
learning is opposite to the
positive reinforcement as it
increases the tendency that the
specific behavior will occur again
by avoiding the negative
condition.
▪ It can be more effective than the
positive reinforcement depending
on situation and behavior, but it
provides reinforcement only to
meet minimum behavior.
https://www.javatpoint.com/reinforcement-learning, https://www.parentingforbrain.com/negative-reinforcement/

Markov Decision Process (MDP)
265

How to represent the agent state?
266
▪ We can represent the agent state using the Markov State that contains all the
required information from the history.
▪ The State St is Markov state if it follows the given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
▪ The Markov state follows the Markov property, which says that the future is
independent of the past and can only be defined with the present.
▪ The RL works on fully observable environments, where the agent can observe
the environment and act for the new state. The complete process is known as
Markov Decision process

Markov Decision Process
267
▪ Markov Decision Process or MDP, is used
to formalize the reinforcement learning
problems. If the environment is completely
observable, then its dynamic can be
modeled as a Markov Process.
▪ In MDP, the agent constantly interacts
with the environment and performs
actions; at each action, the environment
responds and generates a new state.
▪ MDP is used to describe the
environment for the RL, and almost all the
RL problem can be formalized using MDP.

Markov Decision Process
268
▪ MDP contains a tuple of four elements
(S, A, Pa, Ra):
▪ A set of finite States S
▪ A set of finite Actions A
▪ R - Rewards received after transitioning
from state S to state S', due to action a.
▪ Probability Pa.
▪ MDP uses Markov property

Markov Property
269
▪ It says that "If the agent is present in the current
state S1, performs an action a1 and move to the
state s2, then the state transition from s1 to s2
only depends on the current state and future
action and states do not depend on past actions,
rewards, or states.“
▪ Or, in other words, as per Markov Property, the
current state transition does not depend on any
past action or state.
▪ Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the
players only focus on the current state and do
not need to remember past actions or states.

Finite MDP
270
▪ A finite MDP is when there are finite states, finite rewards, and finite actions.
▪ In RL, we consider only the finite MDP.

Markov Process
271
▪ Markov Process is a memoryless process with a
sequence of random states S1, S2, ....., St that
uses the Markov Property.
▪ Markov process is also known as Markov chain,
which is a tuple (S, P) on state S and transition
function P.
▪ These two components (S and P) can define the
dynamics of the system.

Q-Learning:
273
▪ Q-learning is an Off policy RL algorithm,
which is used for the temporal difference
Learning.
▪ The temporal difference learning methods are
the way of comparing temporally successive
predictions.
▪ It learns the value function Q (S, a), which
means how good to take action "a" at a
particular state "s.“
▪ The below flowchart explains the working of Q-
learning

State Action Reward State Action (SARSA)
274
▪ SARSA stands for State Action Reward State Action, which is an on-policy temporal difference
learning method. The on-policy control method selects the action for each state while learning
using a specific policy.
▪ The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all
pairs of (s-a).
▪ The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the
maximum reward for the next state is not required for updating the Q-value in the table.
▪ In SARSA, new action and reward are selected using the same policy, which has
determined the original action.
▪ The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: Original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.

Deep Q Neural Network (DQN)
275
▪ As the name suggests, DQN is a Q-learning using Neural networks.
▪ For a big state space environment, it will be a challenging and complex task to
define and update a Q-table.
▪ To solve such an issue, we can use a DQN algorithm. Where, instead of defining a
Q-table, neural network approximates the Q-values for each action and state.

Q-Learning Explanation
276
▪ Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
▪ The main objective of Q-learning is to learn the policy which can inform the
agent that what actions should be taken for maximizing the reward under what
circumstances.
▪ It is an off-policy RL that attempts to find the best action to take at a current state.
▪ The goal of the agent in Q-learning is to maximize the value of Q.
▪ The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:

277
▪ In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s’.
▪ But there is no any Q-value is given
▪ In the image, we can see there is an agent who has
three values options, V(s1), V(s2), V(s3). As this is
MDP, so agent only cares for the current state and
the future state. The agent can go to any direction
(Up, Left, or Right), so he needs to decide where to
go for the optimal path. Here agent will take a move
as per probability bases and changes the state. But
if we want some exact moves, so for this, we need
to make some changes in terms of Q-value.

278
▪ Q - represents the quality of the actions at each state.
▪ So instead of using a value at each state, we will use a pair of
state and action, i.e., Q(s, a).
▪ Q-value specifies that which action is better than others, and
according to the best Q-value, the agent takes his next move.
The Bellman equation can be used for deriving the Q-value.
▪ To perform any action, the agent will get a reward R(s, a), and
also he will end up on a certain state, so the Q -value equation
will be:
▪ Hence, we can say that, V(s) = max [Q(s, a)]

279
▪ The Q stands for quality in Q-learning, which means it specifies the quality of
an action taken by the agent.

Q-table
280
▪ A Q-table or matrix is created while performing the Q-learning.
▪ The table follows the state and action pair, i.e., [s, a], and initializes the values
to zero.
▪ After each action, the table is updated, and the q-values are stored within the
table.
▪ The RL agent uses this Q-table as a reference table to select the best action
based on the q-values.

Difference Between Reinforcement Learning and Supervised Learning
281
Reinforcement Learning Supervised Learning
▪ RL works by interacting with the
environment.
▪ Supervised learning works on the existing
dataset.
▪ The RL algorithm works like the human
brain works when making some
decisions.
▪ Supervised Learning works as when a
human learns things in the supervision of
a guide.
▪ There is no labeled dataset is present ▪ The labeled dataset is present.
▪ No previous training is provided to the
learning agent.
▪ Training is provided to the algorithm so
that it can predict the output.
▪ RL helps to take decisions sequentially. ▪ In Supervised learning, decisions are
made when input is given.

Reinforcement Learning
▪ There are various applications based on the concept of RL.
Ref: [1]https://www.javatpoint.com/reinforcement-learning;
[2]https://medium.com/@yuxili/rl-applications-73ef685c07eb
[1] [2]
282

Gaussian Mixture Model (GMM)
283

284
▪ k-means exploits only mean of the cluster or distribution as
representation for class-specific information.
▪ Second order moment like variance also contains class-specific
information.
▪ Gaussian distribution can exploit both mean and variance.
▪ In case of scalar it is univariate and in case of vector it is multivariate
Gaussian distribution.

Univariate vs Multivariate Gaussian Distribution
285

286

287

Clustering using Multivariate Gaussian Distribution
288

289

Expectation-Maximization (EM) Algorithm
290

Implementation of EM Algorithm
291

Re-estimation in EM Algorithm
292

Clustering using GMM
293

What is Gaussian Mixture Model (GMMs)?
294
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a
mixture of a finite number of Gaussian distributions with unknown parameters. One can think of a mixture
model as a generalization of a k-means clustering algorithm, as it can be used for density estimation and
classification.
In a Gaussian mixture model, each cluster is associated with a multivariate Gaussian distribution, and the
mixture model is a weighted sum of these distributions. The weights indicate the probability that a data point
belongs to a particular cluster, and the Gaussian distributions describe the distribution of the data within each
cluster.
The parameters of a Gaussian mixture model can be estimated using the expectation-maximization (EM)
algorithm. This involves alternating between estimating the parameters of the Gaussian distributions and the
weights of the mixture model until convergence is reached.

295
https://www.shiksha.com/online-
courses/articles/understanding-gaussian-mixture-models/
The above example code generates a dataset X which
contains 200 samples drawn from two 2D Gaussian
distributions which have different means. The Gaussian
mixture model is then fit to the data, with n_components=2
indicating that there are two mixture components (i.e., two
clusters). The covariance_type parameter specifies the
type of covariance matrix to use for the Gaussian
distributions. In the above example, the covariance_type
value is ‘full’.
Once the model is fit, the prediction method can be used
to predict the cluster labels for the data points in X. The
resulting cluster labels are stored in the predictions array.

296
To plot the data and the predicted cluster labels,
the matplotlib is used, as follows:
The above output is a scattered plot of data,
having points coloured according to their
predicted cluster label.

Real-Life Examples of Gaussian mixture models
297
▪ Gaussian mixture models (GMMs) as already stated above are statistical models that can be
used to represent the probability distribution of a multi-dimensional continuous variable as a
weighted sum of multiple multivariate normal distributions. GMMs are often used in a variety of
applications, including clustering, density estimation, and anomaly detection. Here are a few
examples of how GMMs could be used in real life:
▪ Clustering: GMMs can be used to identify patterns and group similar observations together. For
example, a GMM could be used to cluster customers into different segments based on their
purchase history and demographic data.
▪ Density estimation: GMMs can be used to estimate the probability density function (PDF) of a
given dataset. This can be useful for tasks such as density-based anomaly detection, where
GMMs can be used to identify observations that are significantly different from the rest of the
data.
▪ Anomaly detection: GMMs can be used to detect anomalous observations in a dataset. For
example, a GMM could be trained on normal network traffic data, and then used to identify
unusual traffic patterns that may indicate an intrusion attempt.
▪ Computer vision: GMMs can be used in computer vision applications to model the appearance
of objects in an image. For example, a GMM could be used to model the appearance of different
types of vehicles in a traffic surveillance system.

Advantages of Gaussian Mixture Models
298
▪ Flexibility- Gaussian Mixture Models have the ability to model a wide range of probability
distributions, as they can approximate any distribution that can be represented as a weighted sum
of multiple normal distributions. Hence, very flexible in nature.
▪ Robustness- Gaussian Mixture Models are relatively robust to the outliers which are present in the
data, as they can accommodate the presence of multiple modes called “peaks” in the distribution.
▪ Speed- Gaussian Mixture Models are relatively fast to fit a dataset, especially when using an
efficient optimization algorithm such as the expectation-maximization (EM) algorithm.
▪ To Handle Missing Data- Gaussian Mixture Models have the ability to handle missing data by
marginalizing the missing variables, which can be useful in situations where some observations are
incomplete.
▪ Interpretability- The parameters of a Gaussian Mixture Model (i.e., the weights, means, and
covariances of the components) have a clear interpretation, which can be useful for understanding
the underlying structure of the data.

Disadvantages of Gaussian Mixture Models
299
•Sensitivity To Initialization- Gaussian Mixture Models can be sensitive to the initial values of the model
parameters, especially when there are too many components in the mixture. This can sometimes lead to poor
convergence to the true maximum likelihood solution.
•Assumption Of Normality- Gaussian Mixture Models assume that the data are generated from a mixture of
normal distributions, which may not always be the case in practice. If the data deviate significantly from
normality, GMMs may not be the most appropriate model.
•Number Of Components- Choosing the appropriate number of components in a Gaussian Mixture Model
can be challenging, as adding too many components may overfit the data, while using too few components
may underfit the data. The extremes of both points result in a challenging task, which becomes tough to be
handled.
•High-dimensional data- Gaussian Mixture Models can be computationally expensive to fit when working
with high-dimensional data, as the number of model parameters increases quadratically with the number of
dimensions.
•Limited expressive power- Gaussian Mixture Models can only represent distributions that can be
expressed as a weighted sum of normal distributions. This means that they may not be suitable for modelling
more complex distributions.

Hidden Markov Model in Machine Learning

Hidden Markov Model in Machine Learning
RJEs: Remote job entry points https://www.javatpoint.com/hidden-markov-model-in-machine-learning
▪ Hidden Markov Models (HMMs) are a type of probabilistic model that
are commonly used in machine learning for tasks such as
▪ Speech recognition
▪ Natural language processing
▪ Bioinformatics
▪ They are a popular choice for modelling sequences of data because
they can effectively capture the underlying structure of the data,
even when the data is noisy or incomplete.

What are Hidden Markov Models?
▪ A Hidden Markov Model (HMM) is a probabilistic model that consists of a sequence of
hidden states, each of which generates an observation. The hidden states are usually not
directly observable, and the goal of HMM is to estimate the sequence of hidden states based on a
sequence of observations. An HMM is defined by the following components:
▪ A set of N hidden states, S = {s1, s2, ..., sN}.
▪ A set of M observations, O = {o1, o2, ..., oM}.
▪ An initial state probability distribution, ? = {?1, ?2, ..., ?N}, which specifies the probability of
starting in each hidden state.
▪ A transition probability matrix, A = [aij], defines the probability of moving from one hidden state
to another.
▪ An emission probability matrix, B = [bjk], defines the probability of emitting an observation
from a given hidden state.
▪ The basic idea behind an HMM is that the hidden states generate the observations, and the
observed data is used to estimate the hidden state sequence. This is often referred to as
the forward-backwards algorithm.

Applications of Hidden Markov Models
▪ Speech Recognition
One of the most well-known applications of HMMs is speech recognition. In this field, HMMs are used to
model the different sounds and phones that makeup speech. The hidden states, in this case, correspond to
the different sounds or phones, and the observations are the acoustic signals that are generated by the
speech. The goal is to estimate the hidden state sequence, which corresponds to the transcription of the
speech, based on the observed acoustic signals. HMMs are particularly well-suited for speech recognition
because they can effectively capture the underlying structure of the speech, even when the data is noisy or
incomplete. In speech recognition systems, the HMMs are usually trained on large datasets of speech signals,
and the estimated parameters of the HMMs are used to transcribe speech in real time.
▪ Natural Language Processing
Another important application of HMMs is natural language processing. In this field, HMMs are used for tasks
such as part-of-speech tagging, named entity recognition, and text classification. In these applications,
the hidden states are typically associated with the underlying grammar or structure of the text, while the
observations are the words in the text. The goal is to estimate the hidden state sequence, which corresponds
to the structure or meaning of the text, based on the observed words. HMMs are useful in natural language
processing because they can effectively capture the underlying structure of the text, even when the data is
noisy or ambiguous. In natural language processing systems, the HMMs are usually trained on large datasets
of text, and the estimated parameters of the HMMs are used to perform various NLP tasks, such as text
classification, part-of-speech tagging, and named entity recognition.

Applications of Hidden Markov Models
▪ Bioinformatics
HMMs are also widely used in bioinformatics, where they are used to model sequences of DNA, RNA, and
proteins. The hidden states, in this case, correspond to the different types of residues, while the observations
are the sequences of residues. The goal is to estimate the hidden state sequence, which corresponds to the
underlying structure of the molecule, based on the observed sequences of residues. HMMs are useful in
bioinformatics because they can effectively capture the underlying structure of the molecule, even when the
data is noisy or incomplete. In bioinformatics systems, the HMMs are usually trained on large datasets of
molecular sequences, and the estimated parameters of the HMMs are used to predict the structure or function
of new molecular sequences.
▪ Finance
Finally, HMMs have also been used in finance, where they are used to model stock prices, interest rates, and
currency exchange rates. In these applications, the hidden states correspond to different economic states, such
as bull and bear markets, while the observations are the stock prices, interest rates, or exchange rates. The
goal is to estimate the hidden state sequence, which corresponds to the underlying economic state, based on
the observed prices, rates, or exchange rates. HMMs are useful in finance because they can effectively capture
the underlying economic state, even when the data is noisy or incomplete. In finance systems, the HMMs are
usually trained on large datasets of financial data, and the estimated parameters of the HMMs are used to
make predictions about future market trends or to develop investment strategies.

Limitations of Hidden Markov Models
▪ Limited Modeling Capabilities
One of the key limitations of HMMs is that they are relatively limited in their modelling
capabilities. HMMs are designed to model sequences of data, where the underlying
structure of the data is represented by a set of hidden states. However, the structure of
the data can be quite complex, and the simple structure of HMMs may not be enough to
accurately capture all the details. For example, in speech recognition, the complex
relationship between the speech sounds and the corresponding acoustic signals may
not be fully captured by the simple structure of an HMM.
▪ Overfitting
Another limitation of HMMs is that they can be prone to overfitting, especially when the
number of hidden states is large or the amount of training data is limited. Overfitting
occurs when the model fits the training data too well and is unable to generalize to new
data. This can lead to poor performance when the model is applied to real-world data
and can result in high error rates. To avoid overfitting, it is important to carefully choose
the number of hidden states and to use appropriate regularization techniques.

Limitations of Hidden Markov Models
▪ Lack of Robustness
HMMs are also limited in their robustness to noise and variability in the data. For example, in
speech recognition, the acoustic signals generated by speech can be subjected to a variety of
distortions and noise, which can make it difficult for the HMM to accurately estimate the
underlying structure of the data. In some cases, these distortions and noise can cause the HMM
to make incorrect decisions, which can result in poor performance. To address these limitations,
it is often necessary to use additional processing and filtering techniques, such as noise
reduction and normalization, to pre-process the data before it is fed into the HMM.
▪ Computational Complexity
Finally, HMMs can also be limited by their computational complexity, especially when dealing
with large amounts of data or when using complex models. The computational complexity of
HMMs is due to the need to estimate the parameters of the model and to compute the likelihood
of the data given in the model. This can be time-consuming and computationally expensive,
especially for large models or for data that is sampled at a high frequency. To address this
limitation, it is often necessary to use parallel computing techniques or to use approximations
that reduce the computational complexity of the model.

Naïve Bayes Classifier
▪ Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem[1]
▪ It is mainly used in text classification that includes a high-dimensional training dataset[2]
▪ It is a probabilistic classifier, which means it predicts on the basis of the probability of an object
▪ Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Another strong assumption it does is, that all features are equal i.e., are given the
same weight/importance.
▪ Bayes: It is based on Bayes’ Theorem so is called Bayes. Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that has already occurred. It is mathematically given as:
𝑃
𝐴
𝐵
=
𝑃
𝐵
𝐴
𝑃 𝐴
𝑃 𝐵
where, 𝑃
𝐴
𝐵
is Posterior Probability, P(A) is Prior Probability
𝑃(
𝐵
𝐴
) is Likelihood Probability, P(B) is Marginal Probability
Ref: [1] https://www.geeksforgeeks.org/naive-bayes-classifiers/?ref=leftbar-rightbar
[2] https://www.javatpoint.com/machine-learning-naive-bayes-classifier
307

Ref: [1] https://www.geeksforgeeks.org/naive-bayes-classifiers/?ref=leftbar-rightbar
[2] https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_naive_bayes.htm, [3] https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
▪ There are primarily three types of Naïve Bayes Classifiers:
▪ Gaussian Naïve Bayes – In Gaussian Naive Bayes, continuous values associated with each feature are
assumed to be distributed according to a Gaussian distribution. The likelihood of the features is assumed to
be Gaussian[1]
▪ Multinomial Naïve Bayes – In this features are assumed to be drawn from a simple Multinomial distribution.
Such kinds of Naïve Bayes are most appropriate for the features that represent discrete counts[2]
▪ Bernoulli Naïve Bayes – Here the features are assumed to be binary (0s and 1s). Text classification with ‘bag
of words’ model can be an application of Bernoulli Naïve Bayes
▪ The adjacent figure, shows an
example of a Naïve Bayes classifier
of classifying the probability of play
or no play based on likelihood
estimation [3]
308

▪ Advantages of Naïve Bayes Classifier
▪ Fast and easy ML algorithms to predict a class of datasets
▪ It can be used for Binary as well as Multi-class Classifications
▪ It is the most popular choice for text classification problems
▪ Disadvantage of Naïve Bayes Classifier
▪ Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features
▪ Applications of Naïve Bayes Classifier
▪ It is used for Credit Scoring
▪ It is used in medical data classification
▪ It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner
▪ It is used in Text classification such as Spam filtering and Sentiment analysis
Ref: https://www.javatpoint.com/machine-learning-naive-bayes-classifier
309

Ensemble Classifiers
▪ Ensemble learning helps improve machine learning results by combining several models[1]
▪ Better predictive performance compared to a single model
▪ Ensemble overcomes three problems:
▪ Statistical Problems: when the hypothesis space is too large for the amount of available data
▪ Computational Problems: when the learning algorithm cannot guarantee finding the best hypothesis
▪ Representational Problems: The Representational Problem arises when the hypothesis space does not contain any good
approximation of the target class(es)
▪ The main challenge with ensemble methods is to obtain base models which make different kinds of errors
▪ The three main classes of ensemble learning methods are bagging, stacking, and boosting[2]
▪ Bagging involves fitting many decision trees on different samples of the same dataset and averaging the predictions
▪ Stacking involves fitting many different models types on the same data and using another model to learn how to best
combine the predictions
Ref: [1] https://www.geeksforgeeks.org/ensemble-classifier-data-mining/
[2] https://machinelearningmastery.com/tour-of-ensemble-learning-algorithms/
310

Introduction to Machine Learning (ML) Final - Copy.pdf

More Related Content

Similar to Introduction to Machine Learning (ML) Final - Copy.pdf (20)

More from Dr. Rahul Pandya (20)

Recently uploaded (20)

Introduction to Machine Learning (ML) Final - Copy.pdf