SlideShare a Scribd company logo
UNIT III
SUPERVISED LEARNING
Presented by
Jeyalakshmi.P.,M.E(Ph.D)
AP/ECE
NPR College of Engineering and Technology
Natham
“The field of study that gives computers the ability to learn without
being explicitly programmed “
MACHINE LEARNING
Relation between AI and ML
AI focusses on the non-biological systems that are more likely to exhibit human-like behaviour.
ML algorithms learns the models, trends, patterns and rules automatically from the underlying data
representations.
DL is an emerging technology where the algorithms that parametrize multilayer neural networks to learn
the data representation.
a) Traditional Programming b) Machine Learning
1.Traditional Programming vs Machine Learni
Types of Machine Learning
Supervised algorithms
 These are also known as predictive algorithms, and they are used to classify or predict
the outcomes of new input with prior knowledge acquired from previous inputs.
 The learning process in supervised algorithms is guided by the inputs and their
corresponding outputs.
 The mapping between the inputs and their outputs is the key for the algorithm to learn.
 The final ML model is developed after iterative learning of the training examples.
 The supervised algorithms can be further classified into:
Regression algorithms: The output variable of these algorithms will be
an absolute or discrete value. For instance, the prediction of next day’s
temperature is done using regression algorithm.
Classification algorithms: The output variable of these algorithms is a
class or category. For instance, the prediction of next day’s overall
weather is done using classification algorithm, since the outcome may
be one among {sunny, outcast, rainy, cloudy}.
Examples: linear regression, Support Vector Machines (SVM), Regression
trees, logistic regression etc.
Unsupervised algorithms
 Learning in unsupervised algorithms happens without labelled data that is, the training examples are
presented to the algorithm without target or output.
The algorithm learns the underlying patterns and similarities between the input data to discover the hidden
knowledge. This process is referred as knowledge discovery.
The unsupervised algorithms are further categorized as:
Clustering algorithms: The knowledge discovery in these types of algorithms happens by
uncovering the inherent similarities among the training data.
Association: These are algorithms that extract rules that can possibly describe large class
of data.
Examples: Fuzzy logic, K means clustering, K-Nearest Neighbours etc.
Semi-Supervised algorithms
The learning process in semi-supervised learning algorithms happens partially on labelled data.
The model is trained on relatively small amount of labelled data and larger quantity of
unlabelled data.
The cost incurred to procure labelled data is the motivation for the genesis of semi- supervised
algorithms.
The working of semi-supervised algorithms can be realized in two phases:
Reinforcement learning algorithms
 The learning in reinforcement learning takes place by making the software agents to define an ideal
behaviour in the given learning environment to attain maximum performance.
 The agents will be iteratively rewarded using reinforcement feedback signal. This signal is the central
factor in guiding the agent to adapt or learn the environment and decide on the next step.
 The outcome of the learning in these algorithms is an optimal policy that maximises the performance of
the agent.
 Reinforcement learning is commonly used in the field of robotics.
 Examples: Adversarial networks and Q-learning.
LINEAR REGRESSION MODELS
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y)
variables
Linear Regression is a supervised and statistical ML method that is used for predictive analysis. This uses the relationship
between the data-points to draw a straight line through them.
1.The mathematical equation of simple linear regression is given below:
A least-squares regression method is a form of statistical regression analysis that establishes the
relationship between the dependent (Y) and independent variable (X) through linear line, referred as line
of best fit.
Least Squares Regression
 Least squares regression is used to predict the behavior of dependent variables.
 This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis graph.
 If the data shows a leaner relationship between two variables, the line that best fits this linear relationship is
known as a least-squares regression line, which minimizes the vertical distance from the data points to the
regression line.
 The term “least squares” indicates the smallest sum of squares of errors otherwise known as variance.
 The least-squares method is often applied in data fitting.
 The best fit result is assumed to reduce the sum of squared errors or residuals which are stated to be the
differences between the observed or experimental value and corresponding fitted value given in the model.
 There are two basic categories of least-squares problems:
 Ordinary or linear least squares: used in statistical regression analysis
 Nonlinear least squares: iterative method to approximate the model to a linear model with
each iteration.
 The given data points can to be minimized by the method of reducing residuals or offsets of each point
from the line.
Advantages
 The least-squares method of regression analysis is best suited for prediction models and trend analysis.
 It is best used in the fields of economics, finance, and stock markets wherein the value of any future
variable is predicted with the help of existing variables and the relationship between the same.
 The least-squares method provides the closest relationship between the variables.
 The difference between the sums of squares of residuals to the line of best fit is minimal under this
method.
 The computation mechanism is simple and easy to apply.
Disadvantages
 This method relies on establishing the closest relationship between a given set of variables.
 The computation mechanism is sensitive to the data, and in case of any outliers, the results may
affect severally.
 More exhaustive computation mechanisms are applied for non linear problems.
Least Square Algorithm
•For each (x, y) point calculate x2
and xy
•Sum all x, y, x2
and xy, which gives us Σx, Σy, Σx2
and Σxy
•Calculate Slope b:
Where n is the number of points.
•Calculate Intercept a:
•Assemble the equation of a line: Y= bx+a
The below table give the statistics about the number of hours or rainfall in Chennai and the number of
French fries sold on a week from Monday to Friday in a canteen. Predict the number of French fries to be
prepared on Saturday, if a rainfall of 8 hours is expected.
Hours of Rain (x) No. of French Fries sold (y)
2 4
3 5
5 7
7 10
9 15
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx=26 Σy=41 Σx2
= 168 Σxy=263
Solution:
•For each (x, y) point calculate x2
and xy.
•Find Σx, Σy, Σx2
and Σxy
1. Find the slope (b)
b= (263-(26*41)/5)/ (168-(26*26)/5)= 1.5182
1. Calculate the Intercept a
a=(41-(1.5182*26))/5= 0.3049
1. Form the equation: Y= 1.5182x+0.3049
Computing the Error
x y Y= 1.5182x+0.3049 Error (Y-y)
2 4 3.3413 -0.6587
3 5 4.8595 -0.1405
5 7 7.8959 0.8959
7 10 10.9323 0.9323
9 15 13.9687 -1.0313
Visualizing the Line of fit:
Number of French fries to be prepared if it rains for 8 hours
Substitute x=8 in Y= 1.5182x+0.3049, then Y=12.45
So, approximately 13 French fries will be sold on Saturday
Single and Multiple Variables
 Simple or single linear regression performs regression analysis of two variables. The
single independent variable impacts the slope of the regression line.
 Multiple regression is a broader class of regressions that encompasses linear and
nonlinear regressions with multiple explanatory variables.
 Each independent variable in multiple regression has its own coefficient to ensure each
variable is weighted appropriately to establish complex connections between variables.
 Two main operations are done in multiple variable regression:
i) Determine the dependent variable based on multiple independent variables.
ii) Determine the strength of the relationship is between each variable.
 Multiple regression assumes there is not a strong relationship between each independent
variable.
 It also assumes there is a correlation between each independent variable and the single
dependent variable.
 Each of these relationships is weighted to ensure more impactful independent variables drive
the dependent value by adding a unique regression coefficient to each independent variable.
 Using multiple variables for regression is more specific calculation than simple linear
regression. More complex relationships can be acquired through multiple linear regression.
 All the multiple variables use multiple slopes to predict the outcome of single target variable
Y= a+b1x1+b2x2+…+bnxn
 In the above equation, b1, b2, …, bn are the slopes for the individual variables x1, x2, …, xn.
Single variable
regression
Multiple variable
regression
One dependent
variable Y is predicted
from single explanatory
variable x.
One dependent
variable Y is predicted
from multiple
explanatory variables.
It has single regression
coefficient or slope.
It has multiple regression
coefficients.
Example: Predicting BMI
from age. Example: Predicting
BMI from age, height,
gender etc.
Difference between single and Multiple variable regression
Simple Linear Regression: Only
temperature is used to predict the
weather
Multivariate linear regression:
Three variables are used to predict the weather
Bayesian Regression
Unlike the least square regression, the Bayesian regression uses probability
distributions rather than point estimates. This method is used when the available
data is very scanty.
Bayesian linear regression pushes the idea of the parameter prior a step
regression further and does not even attempt to compute a point estimate of the
parameters, but instead the full posterior distribution over the parameters is
taken into account when making predictions.
This allows to put a prior value on the coefficients and on the noise so
that in the absence of data, the priors can take over the regression process.
Also, Bayesian linear regression can tell which parts of it fit to the data with better
confidence and which parts it is very uncertain.
In Bayesian regression, the response, y, is not estimated as a single value, but
is assumed to be drawn from a probability distribution. The goal of Bayesian
linear regression is to find Posterior instead of model parameters.
Posterior= (Likelihood*Prior)/Normalization
Likelihood describes the probability of the target values given the data and
parameters.
Prior describes the initial knowledge about which parameter values are
likely and unlikely.
Evidence or normalisation describes the joint probability of the data and targets
Steps to perform Bayesian Regression:
Bayesian regression gives a probabilistic outlook to simple linear regression. It assumes that the data
points follow normal distribution with 0 mean and some known variance (σ2
).
Bayesian linear regression estimates distributions over the parameters and predictions. This allows to
model the uncertainty in the predictions.
 Set up a probabilistic model that describes the assumptions how the data and parameters are
generated.
Computation of inference for parameters: Compute the posterior probability distribution over the parameters
 With this posterior, perform inference for new, unseen inputs. This step do not
compute point estimates of the outputs. Instead, it computes the parameters of the
posterior distribution over the outputs.
Advantages:
 This method can retrieve the complete variety of inferential solutions instead of a
point estimate.
 It works efficiently with the small size of the dataset.
 It is very suitable for the online form of learning, whereas, in the form of batch
learning, we have the whole dataset.
 It is a very powerful and tested approach.
Disadvantages:
 It does not work efficiently if the dataset contains a huge amount of data.
 The conjecture of the model can be time-consuming.
Gradient Descent
Gradient Descent is and optimization algorithms to train machine learning models to minimize the
errors between actual and expected results.
Steps to perform Gradient Descent:
 Initialize values for the coefficients for the function. These could be 0 or a small random value.
So, make the coefficients = 0
 The cost of the function is evaluated by plugging the coefficients into the function i.e., cost =
f(coefficients)
 The derivative of the cost function is calculated. To know the slope so that the direction (sign) to
move the coefficient values must be known in order to get a lower cost on the next iteration. So,
calculate, change = derivative(cost)
 Now the downhill direction can be known from the derivative, update the coefficient values
accordingly. Specify a learning rate that controls how much the coefficients can change on each
update. So, coefficient = coefficient - (learning rate * change)
 Repeat this process until the cost is 0 or close to zero.
Types of Gradient Descent
1. Batch gradient descent
2.Stochastic Gradient Descent (SGD
3.Mini-batch gradient descent
CLASSIFICATION MODELS
The classification model is a Supervised Learning technique that identifies the
category of new observations on the basis of training data. Classification
predictive modeling involves assigning a class label to input examples.
Type of classification:
Binary classification refers to predicting one of two classes and multi-class
classification involves predicting one of more than twoclasses.
Example: Email spam detection (spam or not).
Multi-label classification involves predicting one or more classes for each example and
imbalanced classification refers to classification tasks where the distribution of
examples across the classes is not equal.
Example: Optical character recognition.
Discriminant Functions
A discriminant function that is a linear combination of the components of x that can be
written as g(x) = wT
x+w0, where w is the weight vector and w0 is the bias or threshold
weight.
The problem of finding a linear discriminant function can be formulated as a
problem of minimizing a criterion function.
Types of discriminant functions
1.Two case category:
Multi case category:
PROBABILISTIC DISCRIMINANT FUNCTIONS
Probabilistic Linear Discriminant function is a probabilistic version of linear discriminant functions with
abilities to handle more complex data.
This is widely used for recognition, checking similarity, feature extraction and verification
processes
Advantages of PLDA
Generate class center using continuous non-linear functions even from single
example of unseen class.
Compare two examples from previously unseen class(es) to determine whether
they belong to same class.
Perform clustering of samples from unseen classes
Find the class for the regression equation derived in example 1 for x=50. (Regression equation
in Example
Y= 1.5182x+0.3049)
Y= 1.5182* 50+0.3049=
76.2149
S=e 76.2149
= 1.258
P= S/1+S = 0.386.
Using the threshold value as 0.5, the given example will belong to class 0
with a probability of 38%.
Using the threshold value as 0.5, the given example will
belong to class 0 with a probability of 38%.
PROBABILISTIC GENERATIVE MODEL
Generative models or latent variable models or causal models are a way of modeling how a
set of observed data could have arisen from a set of underlying causes.
Naïve Bayes
The Naive Bayes classification algorithm is a probabilistic classifier based on probability models that
incorporate strong independence assumptions. These independence assumptions do not have an
impact on reality, hence considered as naive.
Multinomial Naive Bayes: The features/predictors used by the classifier are used to label the
instances in discrete categories.
Bernoulli Naive Bayes: The predictors used here are boolean variables. The parameters that takes
up only yes or no values.
Gaussian Naive Bayes: When the predictors take up a continuous value and are not discrete,
then it is Gaussian distribution.
Advantages of Naïve Bayes Classifier:
 Fast and easy ML algorithms to predict a class of
datasets.
 It can be used for Binary as well as Multi-class
Classifications.

It performs better in Multi-class predictions
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Example 3
Given the training data in the table below, find the result of the example using Naïve
Bayes classification: age<=30, income=medium, student=yes, credit- rating=fair.
RID Age Income Student Credit_rating Class:
buys computer
1 <=30 High No Fair No
2 <=30 High No Excellent No
3 31 to 40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31 to 40 Low Yes Excellent Yes
8 <=30 Medium No Fair No
9 <=30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <=30 Medium Yes Excellent Yes
12 31 to 40 Medium No Excellent Yes
13 31 to 40 High Yes Fair Yes
14 >40 Medium No Excellent No
AI & ML(Unit III).pptx.It contains  also syllabus
MAXIMUM MARGIN CLASSIFIER
The maximal margin classifier is the optimal hyperplane defined on two linearly separable classes. Given an n×p
data matrix X with a binary response variable defined as y∈[−1,1] it might be possible to define a p-dimensional
hyperplane that separates them.
Support Vector Machines
Support Vector Classifier is an extension of the Maximal Margin Classifier which is less
sensitive to individual data points. It allows few data to be misclassified, hence called Soft
Margin Classifier.
The main goal of SVM is to find a hyperplane in an N-
dimensional space, where N indicates the number of
features that distinctly classifies the data points.
 The SVM classifier constructs a hyper-lane in an N-dimensional space that divides the
data points belonging to different classes.
 However, this hyper-pane is chosen based on margin as the hyperplane providing the
maximum margin between the two classes is considered. These margins are calculated
using data points or Support Vectors.
 Support Vectors are near to the hyper-plane and help in orienting it.
 Multiple hyperplanes exists to separate the two classes of data points.
 The objective is to find a plane with maximum margin. In other words the distance
between data points of both classes and the hyperplanes should be maximum.
 Data points falling on either side of the hyperplane can be attributed to different classes.
 The dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is 3,
then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
 Support vectors are data points that are closer to the hyperplane and are dependent on the
position and orientation of the hyperplane.
Using these support vectors, margins of the classifier are maximized
AI & ML(Unit III).pptx.It contains  also syllabus
Steps in SVM classifier:
1. SVM algorithm predicts the classes. One of the classes is identified as 1 (positive class) while the
other is identified as -1 (negative class).
2. In SVM classifier, hinge loss function is used to find the maximum margin (θ is the parameter
vector).
1. When all classes are correctly predicted, the cost function is 0. The problem with
SVM is that there is a trade-off between maximizing margin and the loss generated if the
margin is maximized to a very large extent.Aregularization parameter is added is also to
the loss function.
2. Weights are optimized by calculating the gradients. The gradients are updated only by
using the regularization parameter when there is no error in the classification while the
loss function is also used when misclassification happens.
AI & ML(Unit III).pptx.It contains  also syllabus
Advantages of Support Vector Machine:
 The SVM can ignore outliers and find the hyper-plane that has the maximum margin.
 SVM works well when there is a clear margin of separation between classes.
 It is more effective in high dimensional spaces.
 SVM is memory efficient.
Disadvantages of Support Vector Machine:
 SVM algorithm is not suitable for large data sets.
 SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.
 As the support vector classifier works by putting data points, above and below the classifying
hyperplane there is no probabilistic explanation for the classification.
Kernel based SVM
Classifying the data points using hyperplanes is not always possible as the data points may not be
linearly separable in 2D or there exist no hyperplane to separate them in 3D. In these cases kernel based
SVM is used.
Using kernels:
 Map a lower dimension set of data points using a mapping function to one higher dimension where
they are separated.
 Fit a line or hyperplane as per requirement to separate thosepoints.
 Project the same data points to lower dimensions.
Advantages of Kernel Support Vector Machine:
The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex
problem
 It scales relatively well to high dimensional data.
 Risk of overfitting is less.
Disadvantages of Kernel Support Vector Machine:
 Choosing a good kernel function is not easy.
 Longer training time for large datasets.
 Difficult to understand and interpret the final model.
 Highly compute intensive.
Decision Trees
A decision tree is a hierarchical model for supervised learning whereby the local
region is identified in a sequence of recursive splits in a smaller number of steps.
Classification Trees
Impurity measures in classification trees:
Entropy: This is a common impurity measure given by the following expression
Gini Index:
The j represents the number of classes in the label, P represents the ratio of class at the
ith
node. The Gini index, computes the degree of probability of a specific variable
that is wrongly being classified when chosen randomly. The degree of gini index varies
from 0 to 1,
 Where 0 depicts that all the elements be allied to a certain class, or only one
class exists there.
 Gini index of value as 1 signifies that all the elements are randomly
distributed across various classes.
 A value of 0.5 denotes the elements are uniformly distributed into some
classes.
Regression Tree
A regression tree is constructed similar to classification tree, except that the impurity
measure that is appropriate for classification is replaced by a measure appropriate for
regression. For node m, Xm is the subset of X reaching node m. It is the
set of all x ∈ X satisfying all the conditions in the decision nodes on the path from the root
until node m.
Pruning:
 In decision tree a node is not split further if the number of training instances
reaching a node is smaller than a certain percentage of the training set.
 Any decision based on too few instances causes variance and induces
generalization error.
 The process of stopping tree construction early on before it is full is called
prepruning the tree.
 In postpruning, a decision tree is generated and continued further on without
backtracking.
 In post pruning trees are grown until all leaves are pure with have no training error.
Then find the subtrees that cause overfitting and prune them.
Rule Extraction from Trees
Random Forest
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that dataset.
The random forest algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome. It combines both the bagging and boosting techniques.
 Bagging: Creates a different training subset from sample training data with replacement
& the final output is based on majority voting.
 Boosting: Combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy.
Steps to construct Random Forest:
Features of Random Forest
 Diversity- Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
 Immune to the curse of dimensionality
 Each tree is created independently out of different data and attributes.
This means that we can make full use of the CPU to build random
forests.
 Random forest does not demand the data to be segregated as train and
test. There will always be 30% of the data which is not seen by the
decision tree.
Stability of results arises because the result is based on majority voting/ averaging
Decision trees Random Forest
Decision trees suffers from overfitting if grown without
any control.
As the output of Random forests are based on average
or majority ranking, there is no overfitting.
A single decision tree is faster in computation. It is comparatively slower as aggregating the results of
the individual trees ma consume time.
When a data set with features is taken as input by a
decision tree it will formulate set of rules to predict the
outcome.
Random forest randomly selects observations, builds a
decision tree and the average result is taken. It is not
rule based.
Differences between decision trees and Random forest
Advantages of random forest
 It can be used in classification as well as regression problems.
 It solves the problem of overfitting as output is based on majority voting or
averaging.
 It performs well even if the data contains null/missing values.
 Each decision tree created is independent of the other thus it shows the property of
parallelization.
 It is highly stable as the average answers given by a large number of trees are taken.
 It maintains diversity as all the attributes are not considered while making each
decision tree though it is not true in all cases.
 It is immune to the curse of dimensionality.
 No train and test split is needed as there will always be 30% of the data which is not
seen by the decision tree.
Disadvantages
 It is highly complex compared to decision trees where decisions can be made
by following the path of the tree.
 Training time is more compared to other models due to its complexity.
Whenever it has to make a prediction each decision tree has to generate
output for the given input data.
 This requires more resources.

More Related Content

PDF
ML_Lec4 introduction to linear regression.pdf
DOCX
NPTEL Machine Learning Week 2.docx
PDF
Supervised Learning.pdf
PDF
KIT-601 Lecture Notes-UNIT-2.pdf
PDF
Correation, Linear Regression and Multilinear Regression using R software
PPTX
Linear Regression
PDF
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
PDF
the unconditional Logistic Regression .pdf
ML_Lec4 introduction to linear regression.pdf
NPTEL Machine Learning Week 2.docx
Supervised Learning.pdf
KIT-601 Lecture Notes-UNIT-2.pdf
Correation, Linear Regression and Multilinear Regression using R software
Linear Regression
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
the unconditional Logistic Regression .pdf

Similar to AI & ML(Unit III).pptx.It contains also syllabus (20)

PDF
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
PDF
A tour of the top 10 algorithms for machine learning newbies
PPTX
UNIT II SUPERVISED LEARNING - Introduction
PPTX
Unit 3 – AIML.pptx
PDF
2018 p 2019-ee-a2
DOCX
Mc0079 computer based optimization methods--phpapp02
PPTX
DS103 - Unit04 - Part1DS103 - Unit04 - Part1.pptx
DOCX
Master of Computer Application (MCA) – Semester 4 MC0079
PPTX
03 Data Mining Techniques
PDF
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
PPTX
UNIT 3.pptx.......................................
PPTX
MODULE-3edited.pptx machine learning modulk
PPTX
MODULE-2.pptx machine learning notes for vtu 6th sem cse
PDF
A Modified KS-test for Feature Selection
PDF
Regression Analysis-Machine Learning -Different Types
PDF
Classifiers
PPT
Decentralized Data Fusion Algorithm using Factor Analysis Model
PPT
Regression analysis ppt
PPTX
Introduction to Machine Learning Elective Course
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
A tour of the top 10 algorithms for machine learning newbies
UNIT II SUPERVISED LEARNING - Introduction
Unit 3 – AIML.pptx
2018 p 2019-ee-a2
Mc0079 computer based optimization methods--phpapp02
DS103 - Unit04 - Part1DS103 - Unit04 - Part1.pptx
Master of Computer Application (MCA) – Semester 4 MC0079
03 Data Mining Techniques
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
UNIT 3.pptx.......................................
MODULE-3edited.pptx machine learning modulk
MODULE-2.pptx machine learning notes for vtu 6th sem cse
A Modified KS-test for Feature Selection
Regression Analysis-Machine Learning -Different Types
Classifiers
Decentralized Data Fusion Algorithm using Factor Analysis Model
Regression analysis ppt
Introduction to Machine Learning Elective Course
Ad

More from NPRCET6 (8)

PPTX
HVE Unit 1 PPT.pptx all brancehes students
PPTX
laplace transform SS 2(4).pptx ECE students
PPTX
fourier transform SS 2(3).pptx. for ECE students
PPTX
very useful to know the content of fdp for public
PPTX
certificate very useful to know the fdp content
PDF
ayps.pdf.Very useful.Need to download for ur work
PPTX
MSME idea 1 fuel level Ameena.Its very useful.pptx
PPTX
AIML UNIT 4.pptx. IT contains syllabus and full subject
HVE Unit 1 PPT.pptx all brancehes students
laplace transform SS 2(4).pptx ECE students
fourier transform SS 2(3).pptx. for ECE students
very useful to know the content of fdp for public
certificate very useful to know the fdp content
ayps.pdf.Very useful.Need to download for ur work
MSME idea 1 fuel level Ameena.Its very useful.pptx
AIML UNIT 4.pptx. IT contains syllabus and full subject
Ad

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Institutional Correction lecture only . . .
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
2.FourierTransform-ShortQuestionswithAnswers.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Cell Structure & Organelles in detailed.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial disease of the cardiovascular and lymphatic systems
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Complications of Minimal Access Surgery at WLH
Institutional Correction lecture only . . .
Insiders guide to clinical Medicine.pdf
Microbial diseases, their pathogenesis and prophylaxis
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
VCE English Exam - Section C Student Revision Booklet
PPH.pptx obstetrics and gynecology in nursing
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
O7-L3 Supply Chain Operations - ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism

AI & ML(Unit III).pptx.It contains also syllabus

  • 1. UNIT III SUPERVISED LEARNING Presented by Jeyalakshmi.P.,M.E(Ph.D) AP/ECE NPR College of Engineering and Technology Natham
  • 2. “The field of study that gives computers the ability to learn without being explicitly programmed “ MACHINE LEARNING Relation between AI and ML
  • 3. AI focusses on the non-biological systems that are more likely to exhibit human-like behaviour. ML algorithms learns the models, trends, patterns and rules automatically from the underlying data representations. DL is an emerging technology where the algorithms that parametrize multilayer neural networks to learn the data representation. a) Traditional Programming b) Machine Learning 1.Traditional Programming vs Machine Learni
  • 4. Types of Machine Learning
  • 5. Supervised algorithms  These are also known as predictive algorithms, and they are used to classify or predict the outcomes of new input with prior knowledge acquired from previous inputs.  The learning process in supervised algorithms is guided by the inputs and their corresponding outputs.  The mapping between the inputs and their outputs is the key for the algorithm to learn.  The final ML model is developed after iterative learning of the training examples.  The supervised algorithms can be further classified into: Regression algorithms: The output variable of these algorithms will be an absolute or discrete value. For instance, the prediction of next day’s temperature is done using regression algorithm. Classification algorithms: The output variable of these algorithms is a class or category. For instance, the prediction of next day’s overall weather is done using classification algorithm, since the outcome may be one among {sunny, outcast, rainy, cloudy}. Examples: linear regression, Support Vector Machines (SVM), Regression trees, logistic regression etc.
  • 6. Unsupervised algorithms  Learning in unsupervised algorithms happens without labelled data that is, the training examples are presented to the algorithm without target or output. The algorithm learns the underlying patterns and similarities between the input data to discover the hidden knowledge. This process is referred as knowledge discovery. The unsupervised algorithms are further categorized as: Clustering algorithms: The knowledge discovery in these types of algorithms happens by uncovering the inherent similarities among the training data. Association: These are algorithms that extract rules that can possibly describe large class of data. Examples: Fuzzy logic, K means clustering, K-Nearest Neighbours etc. Semi-Supervised algorithms The learning process in semi-supervised learning algorithms happens partially on labelled data. The model is trained on relatively small amount of labelled data and larger quantity of unlabelled data. The cost incurred to procure labelled data is the motivation for the genesis of semi- supervised algorithms. The working of semi-supervised algorithms can be realized in two phases:
  • 7. Reinforcement learning algorithms  The learning in reinforcement learning takes place by making the software agents to define an ideal behaviour in the given learning environment to attain maximum performance.  The agents will be iteratively rewarded using reinforcement feedback signal. This signal is the central factor in guiding the agent to adapt or learn the environment and decide on the next step.  The outcome of the learning in these algorithms is an optimal policy that maximises the performance of the agent.  Reinforcement learning is commonly used in the field of robotics.  Examples: Adversarial networks and Q-learning. LINEAR REGRESSION MODELS
  • 8. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables Linear Regression is a supervised and statistical ML method that is used for predictive analysis. This uses the relationship between the data-points to draw a straight line through them. 1.The mathematical equation of simple linear regression is given below: A least-squares regression method is a form of statistical regression analysis that establishes the relationship between the dependent (Y) and independent variable (X) through linear line, referred as line of best fit. Least Squares Regression
  • 9.  Least squares regression is used to predict the behavior of dependent variables.  This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis graph.  If the data shows a leaner relationship between two variables, the line that best fits this linear relationship is known as a least-squares regression line, which minimizes the vertical distance from the data points to the regression line.  The term “least squares” indicates the smallest sum of squares of errors otherwise known as variance.  The least-squares method is often applied in data fitting.  The best fit result is assumed to reduce the sum of squared errors or residuals which are stated to be the differences between the observed or experimental value and corresponding fitted value given in the model.  There are two basic categories of least-squares problems:  Ordinary or linear least squares: used in statistical regression analysis  Nonlinear least squares: iterative method to approximate the model to a linear model with each iteration.  The given data points can to be minimized by the method of reducing residuals or offsets of each point from the line.
  • 10. Advantages  The least-squares method of regression analysis is best suited for prediction models and trend analysis.  It is best used in the fields of economics, finance, and stock markets wherein the value of any future variable is predicted with the help of existing variables and the relationship between the same.  The least-squares method provides the closest relationship between the variables.  The difference between the sums of squares of residuals to the line of best fit is minimal under this method.  The computation mechanism is simple and easy to apply. Disadvantages  This method relies on establishing the closest relationship between a given set of variables.  The computation mechanism is sensitive to the data, and in case of any outliers, the results may affect severally.  More exhaustive computation mechanisms are applied for non linear problems.
  • 11. Least Square Algorithm •For each (x, y) point calculate x2 and xy •Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy •Calculate Slope b: Where n is the number of points. •Calculate Intercept a: •Assemble the equation of a line: Y= bx+a
  • 12. The below table give the statistics about the number of hours or rainfall in Chennai and the number of French fries sold on a week from Monday to Friday in a canteen. Predict the number of French fries to be prepared on Saturday, if a rainfall of 8 hours is expected. Hours of Rain (x) No. of French Fries sold (y) 2 4 3 5 5 7 7 10 9 15
  • 13. x y x2 xy 2 4 4 8 3 5 9 15 5 7 25 35 7 10 49 70 9 15 81 135 Σx=26 Σy=41 Σx2 = 168 Σxy=263 Solution: •For each (x, y) point calculate x2 and xy. •Find Σx, Σy, Σx2 and Σxy
  • 14. 1. Find the slope (b) b= (263-(26*41)/5)/ (168-(26*26)/5)= 1.5182 1. Calculate the Intercept a a=(41-(1.5182*26))/5= 0.3049 1. Form the equation: Y= 1.5182x+0.3049
  • 15. Computing the Error x y Y= 1.5182x+0.3049 Error (Y-y) 2 4 3.3413 -0.6587 3 5 4.8595 -0.1405 5 7 7.8959 0.8959 7 10 10.9323 0.9323 9 15 13.9687 -1.0313
  • 16. Visualizing the Line of fit: Number of French fries to be prepared if it rains for 8 hours Substitute x=8 in Y= 1.5182x+0.3049, then Y=12.45 So, approximately 13 French fries will be sold on Saturday
  • 17. Single and Multiple Variables  Simple or single linear regression performs regression analysis of two variables. The single independent variable impacts the slope of the regression line.  Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions with multiple explanatory variables.  Each independent variable in multiple regression has its own coefficient to ensure each variable is weighted appropriately to establish complex connections between variables.  Two main operations are done in multiple variable regression: i) Determine the dependent variable based on multiple independent variables. ii) Determine the strength of the relationship is between each variable.
  • 18.  Multiple regression assumes there is not a strong relationship between each independent variable.  It also assumes there is a correlation between each independent variable and the single dependent variable.  Each of these relationships is weighted to ensure more impactful independent variables drive the dependent value by adding a unique regression coefficient to each independent variable.  Using multiple variables for regression is more specific calculation than simple linear regression. More complex relationships can be acquired through multiple linear regression.  All the multiple variables use multiple slopes to predict the outcome of single target variable Y= a+b1x1+b2x2+…+bnxn  In the above equation, b1, b2, …, bn are the slopes for the individual variables x1, x2, …, xn.
  • 19. Single variable regression Multiple variable regression One dependent variable Y is predicted from single explanatory variable x. One dependent variable Y is predicted from multiple explanatory variables. It has single regression coefficient or slope. It has multiple regression coefficients. Example: Predicting BMI from age. Example: Predicting BMI from age, height, gender etc. Difference between single and Multiple variable regression
  • 20. Simple Linear Regression: Only temperature is used to predict the weather Multivariate linear regression: Three variables are used to predict the weather
  • 21. Bayesian Regression Unlike the least square regression, the Bayesian regression uses probability distributions rather than point estimates. This method is used when the available data is very scanty. Bayesian linear regression pushes the idea of the parameter prior a step regression further and does not even attempt to compute a point estimate of the parameters, but instead the full posterior distribution over the parameters is taken into account when making predictions. This allows to put a prior value on the coefficients and on the noise so that in the absence of data, the priors can take over the regression process. Also, Bayesian linear regression can tell which parts of it fit to the data with better confidence and which parts it is very uncertain.
  • 22. In Bayesian regression, the response, y, is not estimated as a single value, but is assumed to be drawn from a probability distribution. The goal of Bayesian linear regression is to find Posterior instead of model parameters. Posterior= (Likelihood*Prior)/Normalization
  • 23. Likelihood describes the probability of the target values given the data and parameters. Prior describes the initial knowledge about which parameter values are likely and unlikely. Evidence or normalisation describes the joint probability of the data and targets Steps to perform Bayesian Regression: Bayesian regression gives a probabilistic outlook to simple linear regression. It assumes that the data points follow normal distribution with 0 mean and some known variance (σ2 ).
  • 24. Bayesian linear regression estimates distributions over the parameters and predictions. This allows to model the uncertainty in the predictions.  Set up a probabilistic model that describes the assumptions how the data and parameters are generated. Computation of inference for parameters: Compute the posterior probability distribution over the parameters  With this posterior, perform inference for new, unseen inputs. This step do not compute point estimates of the outputs. Instead, it computes the parameters of the posterior distribution over the outputs. Advantages:  This method can retrieve the complete variety of inferential solutions instead of a point estimate.  It works efficiently with the small size of the dataset.  It is very suitable for the online form of learning, whereas, in the form of batch learning, we have the whole dataset.  It is a very powerful and tested approach. Disadvantages:  It does not work efficiently if the dataset contains a huge amount of data.  The conjecture of the model can be time-consuming.
  • 25. Gradient Descent Gradient Descent is and optimization algorithms to train machine learning models to minimize the errors between actual and expected results. Steps to perform Gradient Descent:  Initialize values for the coefficients for the function. These could be 0 or a small random value. So, make the coefficients = 0  The cost of the function is evaluated by plugging the coefficients into the function i.e., cost = f(coefficients)  The derivative of the cost function is calculated. To know the slope so that the direction (sign) to move the coefficient values must be known in order to get a lower cost on the next iteration. So, calculate, change = derivative(cost)  Now the downhill direction can be known from the derivative, update the coefficient values accordingly. Specify a learning rate that controls how much the coefficients can change on each update. So, coefficient = coefficient - (learning rate * change)  Repeat this process until the cost is 0 or close to zero.
  • 26. Types of Gradient Descent 1. Batch gradient descent 2.Stochastic Gradient Descent (SGD 3.Mini-batch gradient descent
  • 27. CLASSIFICATION MODELS The classification model is a Supervised Learning technique that identifies the category of new observations on the basis of training data. Classification predictive modeling involves assigning a class label to input examples. Type of classification: Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than twoclasses. Example: Email spam detection (spam or not). Multi-label classification involves predicting one or more classes for each example and imbalanced classification refers to classification tasks where the distribution of examples across the classes is not equal. Example: Optical character recognition.
  • 28. Discriminant Functions A discriminant function that is a linear combination of the components of x that can be written as g(x) = wT x+w0, where w is the weight vector and w0 is the bias or threshold weight. The problem of finding a linear discriminant function can be formulated as a problem of minimizing a criterion function. Types of discriminant functions 1.Two case category:
  • 30. PROBABILISTIC DISCRIMINANT FUNCTIONS Probabilistic Linear Discriminant function is a probabilistic version of linear discriminant functions with abilities to handle more complex data. This is widely used for recognition, checking similarity, feature extraction and verification processes
  • 31. Advantages of PLDA Generate class center using continuous non-linear functions even from single example of unseen class. Compare two examples from previously unseen class(es) to determine whether they belong to same class. Perform clustering of samples from unseen classes
  • 32. Find the class for the regression equation derived in example 1 for x=50. (Regression equation in Example Y= 1.5182x+0.3049) Y= 1.5182* 50+0.3049= 76.2149 S=e 76.2149 = 1.258 P= S/1+S = 0.386. Using the threshold value as 0.5, the given example will belong to class 0 with a probability of 38%. Using the threshold value as 0.5, the given example will belong to class 0 with a probability of 38%.
  • 33. PROBABILISTIC GENERATIVE MODEL Generative models or latent variable models or causal models are a way of modeling how a set of observed data could have arisen from a set of underlying causes.
  • 34. Naïve Bayes The Naive Bayes classification algorithm is a probabilistic classifier based on probability models that incorporate strong independence assumptions. These independence assumptions do not have an impact on reality, hence considered as naive. Multinomial Naive Bayes: The features/predictors used by the classifier are used to label the instances in discrete categories. Bernoulli Naive Bayes: The predictors used here are boolean variables. The parameters that takes up only yes or no values. Gaussian Naive Bayes: When the predictors take up a continuous value and are not discrete, then it is Gaussian distribution. Advantages of Naïve Bayes Classifier:  Fast and easy ML algorithms to predict a class of datasets.  It can be used for Binary as well as Multi-class Classifications.  It performs better in Multi-class predictions
  • 35. Disadvantages of Naïve Bayes Classifier: Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features. Example 3 Given the training data in the table below, find the result of the example using Naïve Bayes classification: age<=30, income=medium, student=yes, credit- rating=fair. RID Age Income Student Credit_rating Class: buys computer 1 <=30 High No Fair No 2 <=30 High No Excellent No 3 31 to 40 High No Fair Yes 4 >40 Medium No Fair Yes 5 >40 Low Yes Fair Yes 6 >40 Low Yes Excellent No 7 31 to 40 Low Yes Excellent Yes 8 <=30 Medium No Fair No 9 <=30 Low Yes Fair Yes 10 >40 Medium Yes Fair Yes 11 <=30 Medium Yes Excellent Yes 12 31 to 40 Medium No Excellent Yes 13 31 to 40 High Yes Fair Yes 14 >40 Medium No Excellent No
  • 37. MAXIMUM MARGIN CLASSIFIER The maximal margin classifier is the optimal hyperplane defined on two linearly separable classes. Given an n×p data matrix X with a binary response variable defined as y∈[−1,1] it might be possible to define a p-dimensional hyperplane that separates them.
  • 38. Support Vector Machines Support Vector Classifier is an extension of the Maximal Margin Classifier which is less sensitive to individual data points. It allows few data to be misclassified, hence called Soft Margin Classifier. The main goal of SVM is to find a hyperplane in an N- dimensional space, where N indicates the number of features that distinctly classifies the data points.  The SVM classifier constructs a hyper-lane in an N-dimensional space that divides the data points belonging to different classes.  However, this hyper-pane is chosen based on margin as the hyperplane providing the maximum margin between the two classes is considered. These margins are calculated using data points or Support Vectors.  Support Vectors are near to the hyper-plane and help in orienting it.  Multiple hyperplanes exists to separate the two classes of data points.
  • 39.  The objective is to find a plane with maximum margin. In other words the distance between data points of both classes and the hyperplanes should be maximum.  Data points falling on either side of the hyperplane can be attributed to different classes.  The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.  Support vectors are data points that are closer to the hyperplane and are dependent on the position and orientation of the hyperplane. Using these support vectors, margins of the classifier are maximized
  • 41. Steps in SVM classifier: 1. SVM algorithm predicts the classes. One of the classes is identified as 1 (positive class) while the other is identified as -1 (negative class). 2. In SVM classifier, hinge loss function is used to find the maximum margin (θ is the parameter vector). 1. When all classes are correctly predicted, the cost function is 0. The problem with SVM is that there is a trade-off between maximizing margin and the loss generated if the margin is maximized to a very large extent.Aregularization parameter is added is also to the loss function. 2. Weights are optimized by calculating the gradients. The gradients are updated only by using the regularization parameter when there is no error in the classification while the loss function is also used when misclassification happens.
  • 43. Advantages of Support Vector Machine:  The SVM can ignore outliers and find the hyper-plane that has the maximum margin.  SVM works well when there is a clear margin of separation between classes.  It is more effective in high dimensional spaces.  SVM is memory efficient. Disadvantages of Support Vector Machine:  SVM algorithm is not suitable for large data sets.  SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.  As the support vector classifier works by putting data points, above and below the classifying hyperplane there is no probabilistic explanation for the classification.
  • 44. Kernel based SVM Classifying the data points using hyperplanes is not always possible as the data points may not be linearly separable in 2D or there exist no hyperplane to separate them in 3D. In these cases kernel based SVM is used. Using kernels:  Map a lower dimension set of data points using a mapping function to one higher dimension where they are separated.  Fit a line or hyperplane as per requirement to separate thosepoints.  Project the same data points to lower dimensions. Advantages of Kernel Support Vector Machine: The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem
  • 45.  It scales relatively well to high dimensional data.  Risk of overfitting is less. Disadvantages of Kernel Support Vector Machine:  Choosing a good kernel function is not easy.  Longer training time for large datasets.  Difficult to understand and interpret the final model.  Highly compute intensive.
  • 46. Decision Trees A decision tree is a hierarchical model for supervised learning whereby the local region is identified in a sequence of recursive splits in a smaller number of steps.
  • 47. Classification Trees Impurity measures in classification trees: Entropy: This is a common impurity measure given by the following expression Gini Index:
  • 48. The j represents the number of classes in the label, P represents the ratio of class at the ith node. The Gini index, computes the degree of probability of a specific variable that is wrongly being classified when chosen randomly. The degree of gini index varies from 0 to 1,  Where 0 depicts that all the elements be allied to a certain class, or only one class exists there.  Gini index of value as 1 signifies that all the elements are randomly distributed across various classes.  A value of 0.5 denotes the elements are uniformly distributed into some classes. Regression Tree A regression tree is constructed similar to classification tree, except that the impurity measure that is appropriate for classification is replaced by a measure appropriate for regression. For node m, Xm is the subset of X reaching node m. It is the set of all x ∈ X satisfying all the conditions in the decision nodes on the path from the root until node m.
  • 49. Pruning:  In decision tree a node is not split further if the number of training instances reaching a node is smaller than a certain percentage of the training set.  Any decision based on too few instances causes variance and induces generalization error.  The process of stopping tree construction early on before it is full is called prepruning the tree.  In postpruning, a decision tree is generated and continued further on without backtracking.  In post pruning trees are grown until all leaves are pure with have no training error. Then find the subtrees that cause overfitting and prune them.
  • 51. Random Forest Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset. The random forest algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome. It combines both the bagging and boosting techniques.  Bagging: Creates a different training subset from sample training data with replacement & the final output is based on majority voting.  Boosting: Combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy.
  • 52. Steps to construct Random Forest:
  • 53. Features of Random Forest  Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.  Immune to the curse of dimensionality  Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.  Random forest does not demand the data to be segregated as train and test. There will always be 30% of the data which is not seen by the decision tree. Stability of results arises because the result is based on majority voting/ averaging
  • 54. Decision trees Random Forest Decision trees suffers from overfitting if grown without any control. As the output of Random forests are based on average or majority ranking, there is no overfitting. A single decision tree is faster in computation. It is comparatively slower as aggregating the results of the individual trees ma consume time. When a data set with features is taken as input by a decision tree it will formulate set of rules to predict the outcome. Random forest randomly selects observations, builds a decision tree and the average result is taken. It is not rule based. Differences between decision trees and Random forest
  • 55. Advantages of random forest  It can be used in classification as well as regression problems.  It solves the problem of overfitting as output is based on majority voting or averaging.  It performs well even if the data contains null/missing values.  Each decision tree created is independent of the other thus it shows the property of parallelization.  It is highly stable as the average answers given by a large number of trees are taken.  It maintains diversity as all the attributes are not considered while making each decision tree though it is not true in all cases.  It is immune to the curse of dimensionality.  No train and test split is needed as there will always be 30% of the data which is not seen by the decision tree.
  • 56. Disadvantages  It is highly complex compared to decision trees where decisions can be made by following the path of the tree.  Training time is more compared to other models due to its complexity. Whenever it has to make a prediction each decision tree has to generate output for the given input data.  This requires more resources.