AI & ML(Unit III).pptx.It contains also syllabus

UNIT III
SUPERVISED LEARNING
Presented by
Jeyalakshmi.P.,M.E(Ph.D)
AP/ECE
NPR College of Engineering and Technology
Natham

“The field of study that gives computers the ability to learn without
being explicitly programmed “
MACHINE LEARNING
Relation between AI and ML

AI focusses on the non-biological systems that are more likely to exhibit human-like behaviour.
ML algorithms learns the models, trends, patterns and rules automatically from the underlying data
representations.
DL is an emerging technology where the algorithms that parametrize multilayer neural networks to learn
the data representation.
a) Traditional Programming b) Machine Learning
1.Traditional Programming vs Machine Learni

Supervised algorithms
 These are also known as predictive algorithms, and they are used to classify or predict
the outcomes of new input with prior knowledge acquired from previous inputs.
 The learning process in supervised algorithms is guided by the inputs and their
corresponding outputs.
 The mapping between the inputs and their outputs is the key for the algorithm to learn.
 The final ML model is developed after iterative learning of the training examples.
 The supervised algorithms can be further classified into:
Regression algorithms: The output variable of these algorithms will be
an absolute or discrete value. For instance, the prediction of next day’s
temperature is done using regression algorithm.
Classification algorithms: The output variable of these algorithms is a
class or category. For instance, the prediction of next day’s overall
weather is done using classification algorithm, since the outcome may
be one among {sunny, outcast, rainy, cloudy}.
Examples: linear regression, Support Vector Machines (SVM), Regression
trees, logistic regression etc.

Unsupervised algorithms
 Learning in unsupervised algorithms happens without labelled data that is, the training examples are
presented to the algorithm without target or output.
The algorithm learns the underlying patterns and similarities between the input data to discover the hidden
knowledge. This process is referred as knowledge discovery.
The unsupervised algorithms are further categorized as:
Clustering algorithms: The knowledge discovery in these types of algorithms happens by
uncovering the inherent similarities among the training data.
Association: These are algorithms that extract rules that can possibly describe large class
of data.
Examples: Fuzzy logic, K means clustering, K-Nearest Neighbours etc.
Semi-Supervised algorithms
The learning process in semi-supervised learning algorithms happens partially on labelled data.
The model is trained on relatively small amount of labelled data and larger quantity of
unlabelled data.
The cost incurred to procure labelled data is the motivation for the genesis of semi- supervised
algorithms.
The working of semi-supervised algorithms can be realized in two phases:

Reinforcement learning algorithms
 The learning in reinforcement learning takes place by making the software agents to define an ideal
behaviour in the given learning environment to attain maximum performance.
 The agents will be iteratively rewarded using reinforcement feedback signal. This signal is the central
factor in guiding the agent to adapt or learn the environment and decide on the next step.
 The outcome of the learning in these algorithms is an optimal policy that maximises the performance of
the agent.
 Reinforcement learning is commonly used in the field of robotics.
 Examples: Adversarial networks and Q-learning.
LINEAR REGRESSION MODELS

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y)
variables
Linear Regression is a supervised and statistical ML method that is used for predictive analysis. This uses the relationship
between the data-points to draw a straight line through them.
1.The mathematical equation of simple linear regression is given below:
A least-squares regression method is a form of statistical regression analysis that establishes the
relationship between the dependent (Y) and independent variable (X) through linear line, referred as line
of best fit.
Least Squares Regression

 Least squares regression is used to predict the behavior of dependent variables.
 This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis graph.
 If the data shows a leaner relationship between two variables, the line that best fits this linear relationship is
known as a least-squares regression line, which minimizes the vertical distance from the data points to the
regression line.
 The term “least squares” indicates the smallest sum of squares of errors otherwise known as variance.
 The least-squares method is often applied in data fitting.
 The best fit result is assumed to reduce the sum of squared errors or residuals which are stated to be the
differences between the observed or experimental value and corresponding fitted value given in the model.
 There are two basic categories of least-squares problems:
 Ordinary or linear least squares: used in statistical regression analysis
 Nonlinear least squares: iterative method to approximate the model to a linear model with
each iteration.
 The given data points can to be minimized by the method of reducing residuals or offsets of each point
from the line.

Advantages
 The least-squares method of regression analysis is best suited for prediction models and trend analysis.
 It is best used in the fields of economics, finance, and stock markets wherein the value of any future
variable is predicted with the help of existing variables and the relationship between the same.
 The least-squares method provides the closest relationship between the variables.
 The difference between the sums of squares of residuals to the line of best fit is minimal under this
method.
 The computation mechanism is simple and easy to apply.
Disadvantages
 This method relies on establishing the closest relationship between a given set of variables.
 The computation mechanism is sensitive to the data, and in case of any outliers, the results may
affect severally.
 More exhaustive computation mechanisms are applied for non linear problems.

Least Square Algorithm
•For each (x, y) point calculate x2
and xy
•Sum all x, y, x2
and xy, which gives us Σx, Σy, Σx2
and Σxy
•Calculate Slope b:
Where n is the number of points.
•Calculate Intercept a:
•Assemble the equation of a line: Y= bx+a

The below table give the statistics about the number of hours or rainfall in Chennai and the number of
French fries sold on a week from Monday to Friday in a canteen. Predict the number of French fries to be
prepared on Saturday, if a rainfall of 8 hours is expected.
Hours of Rain (x) No. of French Fries sold (y)
2 4
3 5
5 7
7 10
9 15

x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx=26 Σy=41 Σx2
= 168 Σxy=263
Solution:
•For each (x, y) point calculate x2
and xy.
•Find Σx, Σy, Σx2
and Σxy

1. Find the slope (b)
b= (263-(26*41)/5)/ (168-(26*26)/5)= 1.5182
1. Calculate the Intercept a
a=(41-(1.5182*26))/5= 0.3049
1. Form the equation: Y= 1.5182x+0.3049

Computing the Error
x y Y= 1.5182x+0.3049 Error (Y-y)
2 4 3.3413 -0.6587
3 5 4.8595 -0.1405
5 7 7.8959 0.8959
7 10 10.9323 0.9323
9 15 13.9687 -1.0313

Visualizing the Line of fit:
Number of French fries to be prepared if it rains for 8 hours
Substitute x=8 in Y= 1.5182x+0.3049, then Y=12.45
So, approximately 13 French fries will be sold on Saturday

Single and Multiple Variables
 Simple or single linear regression performs regression analysis of two variables. The
single independent variable impacts the slope of the regression line.
 Multiple regression is a broader class of regressions that encompasses linear and
nonlinear regressions with multiple explanatory variables.
 Each independent variable in multiple regression has its own coefficient to ensure each
variable is weighted appropriately to establish complex connections between variables.
 Two main operations are done in multiple variable regression:
i) Determine the dependent variable based on multiple independent variables.
ii) Determine the strength of the relationship is between each variable.

 Multiple regression assumes there is not a strong relationship between each independent
variable.
 It also assumes there is a correlation between each independent variable and the single
dependent variable.
 Each of these relationships is weighted to ensure more impactful independent variables drive
the dependent value by adding a unique regression coefficient to each independent variable.
 Using multiple variables for regression is more specific calculation than simple linear
regression. More complex relationships can be acquired through multiple linear regression.
 All the multiple variables use multiple slopes to predict the outcome of single target variable
Y= a+b1x1+b2x2+…+bnxn
 In the above equation, b1, b2, …, bn are the slopes for the individual variables x1, x2, …, xn.

Single variable
regression
Multiple variable
regression
One dependent
variable Y is predicted
from single explanatory
variable x.
One dependent
variable Y is predicted
from multiple
explanatory variables.
It has single regression
coefficient or slope.
It has multiple regression
coefficients.
Example: Predicting BMI
from age. Example: Predicting
BMI from age, height,
gender etc.
Difference between single and Multiple variable regression

Simple Linear Regression: Only
temperature is used to predict the
weather
Multivariate linear regression:
Three variables are used to predict the weather

Bayesian Regression
Unlike the least square regression, the Bayesian regression uses probability
distributions rather than point estimates. This method is used when the available
data is very scanty.
Bayesian linear regression pushes the idea of the parameter prior a step
regression further and does not even attempt to compute a point estimate of the
parameters, but instead the full posterior distribution over the parameters is
taken into account when making predictions.
This allows to put a prior value on the coefficients and on the noise so
that in the absence of data, the priors can take over the regression process.
Also, Bayesian linear regression can tell which parts of it fit to the data with better
confidence and which parts it is very uncertain.

In Bayesian regression, the response, y, is not estimated as a single value, but
is assumed to be drawn from a probability distribution. The goal of Bayesian
linear regression is to find Posterior instead of model parameters.
Posterior= (Likelihood*Prior)/Normalization

Likelihood describes the probability of the target values given the data and
parameters.
Prior describes the initial knowledge about which parameter values are
likely and unlikely.
Evidence or normalisation describes the joint probability of the data and targets
Steps to perform Bayesian Regression:
Bayesian regression gives a probabilistic outlook to simple linear regression. It assumes that the data
points follow normal distribution with 0 mean and some known variance (σ2
).

Bayesian linear regression estimates distributions over the parameters and predictions. This allows to
model the uncertainty in the predictions.
 Set up a probabilistic model that describes the assumptions how the data and parameters are
generated.
Computation of inference for parameters: Compute the posterior probability distribution over the parameters
 With this posterior, perform inference for new, unseen inputs. This step do not
compute point estimates of the outputs. Instead, it computes the parameters of the
posterior distribution over the outputs.
Advantages:
 This method can retrieve the complete variety of inferential solutions instead of a
point estimate.
 It works efficiently with the small size of the dataset.
 It is very suitable for the online form of learning, whereas, in the form of batch
learning, we have the whole dataset.
 It is a very powerful and tested approach.
Disadvantages:
 It does not work efficiently if the dataset contains a huge amount of data.
 The conjecture of the model can be time-consuming.

Gradient Descent
Gradient Descent is and optimization algorithms to train machine learning models to minimize the
errors between actual and expected results.
Steps to perform Gradient Descent:
 Initialize values for the coefficients for the function. These could be 0 or a small random value.
So, make the coefficients = 0
 The cost of the function is evaluated by plugging the coefficients into the function i.e., cost =
f(coefficients)
 The derivative of the cost function is calculated. To know the slope so that the direction (sign) to
move the coefficient values must be known in order to get a lower cost on the next iteration. So,
calculate, change = derivative(cost)
 Now the downhill direction can be known from the derivative, update the coefficient values
accordingly. Specify a learning rate that controls how much the coefficients can change on each
update. So, coefficient = coefficient - (learning rate * change)
 Repeat this process until the cost is 0 or close to zero.

Types of Gradient Descent
1. Batch gradient descent
2.Stochastic Gradient Descent (SGD
3.Mini-batch gradient descent

CLASSIFICATION MODELS
The classification model is a Supervised Learning technique that identifies the
category of new observations on the basis of training data. Classification
predictive modeling involves assigning a class label to input examples.
Type of classification:
Binary classification refers to predicting one of two classes and multi-class
classification involves predicting one of more than twoclasses.
Example: Email spam detection (spam or not).
Multi-label classification involves predicting one or more classes for each example and
imbalanced classification refers to classification tasks where the distribution of
examples across the classes is not equal.
Example: Optical character recognition.

Discriminant Functions
A discriminant function that is a linear combination of the components of x that can be
written as g(x) = wT
x+w0, where w is the weight vector and w0 is the bias or threshold
weight.
The problem of finding a linear discriminant function can be formulated as a
problem of minimizing a criterion function.
Types of discriminant functions
1.Two case category:

PROBABILISTIC DISCRIMINANT FUNCTIONS
Probabilistic Linear Discriminant function is a probabilistic version of linear discriminant functions with
abilities to handle more complex data.
This is widely used for recognition, checking similarity, feature extraction and verification
processes

Advantages of PLDA
Generate class center using continuous non-linear functions even from single
example of unseen class.
Compare two examples from previously unseen class(es) to determine whether
they belong to same class.
Perform clustering of samples from unseen classes

Find the class for the regression equation derived in example 1 for x=50. (Regression equation
in Example
Y= 1.5182x+0.3049)
Y= 1.5182* 50+0.3049=
76.2149
S=e 76.2149
= 1.258
P= S/1+S = 0.386.
Using the threshold value as 0.5, the given example will belong to class 0
with a probability of 38%.
Using the threshold value as 0.5, the given example will
belong to class 0 with a probability of 38%.

PROBABILISTIC GENERATIVE MODEL
Generative models or latent variable models or causal models are a way of modeling how a
set of observed data could have arisen from a set of underlying causes.

Naïve Bayes
The Naive Bayes classification algorithm is a probabilistic classifier based on probability models that
incorporate strong independence assumptions. These independence assumptions do not have an
impact on reality, hence considered as naive.
Multinomial Naive Bayes: The features/predictors used by the classifier are used to label the
instances in discrete categories.
Bernoulli Naive Bayes: The predictors used here are boolean variables. The parameters that takes
up only yes or no values.
Gaussian Naive Bayes: When the predictors take up a continuous value and are not discrete,
then it is Gaussian distribution.
Advantages of Naïve Bayes Classifier:
 Fast and easy ML algorithms to predict a class of
datasets.
 It can be used for Binary as well as Multi-class
Classifications.

It performs better in Multi-class predictions

Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Example 3
Given the training data in the table below, find the result of the example using Naïve
Bayes classification: age<=30, income=medium, student=yes, credit- rating=fair.
RID Age Income Student Credit_rating Class:
buys computer
1 <=30 High No Fair No
2 <=30 High No Excellent No
3 31 to 40 High No Fair Yes
4 >40 Medium No Fair Yes
5 >40 Low Yes Fair Yes
6 >40 Low Yes Excellent No
7 31 to 40 Low Yes Excellent Yes
8 <=30 Medium No Fair No
9 <=30 Low Yes Fair Yes
10 >40 Medium Yes Fair Yes
11 <=30 Medium Yes Excellent Yes
12 31 to 40 Medium No Excellent Yes
13 31 to 40 High Yes Fair Yes
14 >40 Medium No Excellent No

AI & ML(Unit III).pptx.It contains also syllabus

MAXIMUM MARGIN CLASSIFIER
The maximal margin classifier is the optimal hyperplane defined on two linearly separable classes. Given an n×p
data matrix X with a binary response variable defined as y∈[−1,1] it might be possible to define a p-dimensional
hyperplane that separates them.

Support Vector Machines
Support Vector Classifier is an extension of the Maximal Margin Classifier which is less
sensitive to individual data points. It allows few data to be misclassified, hence called Soft
Margin Classifier.
The main goal of SVM is to find a hyperplane in an N-
dimensional space, where N indicates the number of
features that distinctly classifies the data points.
 The SVM classifier constructs a hyper-lane in an N-dimensional space that divides the
data points belonging to different classes.
 However, this hyper-pane is chosen based on margin as the hyperplane providing the
maximum margin between the two classes is considered. These margins are calculated
using data points or Support Vectors.
 Support Vectors are near to the hyper-plane and help in orienting it.
 Multiple hyperplanes exists to separate the two classes of data points.

 The objective is to find a plane with maximum margin. In other words the distance
between data points of both classes and the hyperplanes should be maximum.
 Data points falling on either side of the hyperplane can be attributed to different classes.
 The dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is 3,
then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
 Support vectors are data points that are closer to the hyperplane and are dependent on the
position and orientation of the hyperplane.
Using these support vectors, margins of the classifier are maximized

Steps in SVM classifier:
1. SVM algorithm predicts the classes. One of the classes is identified as 1 (positive class) while the
other is identified as -1 (negative class).
2. In SVM classifier, hinge loss function is used to find the maximum margin (θ is the parameter
vector).
1. When all classes are correctly predicted, the cost function is 0. The problem with
SVM is that there is a trade-off between maximizing margin and the loss generated if the
margin is maximized to a very large extent.Aregularization parameter is added is also to
the loss function.
2. Weights are optimized by calculating the gradients. The gradients are updated only by
using the regularization parameter when there is no error in the classification while the
loss function is also used when misclassification happens.

Advantages of Support Vector Machine:
 The SVM can ignore outliers and find the hyper-plane that has the maximum margin.
 SVM works well when there is a clear margin of separation between classes.
 It is more effective in high dimensional spaces.
 SVM is memory efficient.
Disadvantages of Support Vector Machine:
 SVM algorithm is not suitable for large data sets.
 SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.
 As the support vector classifier works by putting data points, above and below the classifying
hyperplane there is no probabilistic explanation for the classification.

Kernel based SVM
Classifying the data points using hyperplanes is not always possible as the data points may not be
linearly separable in 2D or there exist no hyperplane to separate them in 3D. In these cases kernel based
SVM is used.
Using kernels:
 Map a lower dimension set of data points using a mapping function to one higher dimension where
they are separated.
 Fit a line or hyperplane as per requirement to separate thosepoints.
 Project the same data points to lower dimensions.
Advantages of Kernel Support Vector Machine:
The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex
problem

 It scales relatively well to high dimensional data.
 Risk of overfitting is less.
Disadvantages of Kernel Support Vector Machine:
 Choosing a good kernel function is not easy.
 Longer training time for large datasets.
 Difficult to understand and interpret the final model.
 Highly compute intensive.

Decision Trees
A decision tree is a hierarchical model for supervised learning whereby the local
region is identified in a sequence of recursive splits in a smaller number of steps.

Classification Trees
Impurity measures in classification trees:
Entropy: This is a common impurity measure given by the following expression
Gini Index:

The j represents the number of classes in the label, P represents the ratio of class at the
ith
node. The Gini index, computes the degree of probability of a specific variable
that is wrongly being classified when chosen randomly. The degree of gini index varies
from 0 to 1,
 Where 0 depicts that all the elements be allied to a certain class, or only one
class exists there.
 Gini index of value as 1 signifies that all the elements are randomly
distributed across various classes.
 A value of 0.5 denotes the elements are uniformly distributed into some
classes.
Regression Tree
A regression tree is constructed similar to classification tree, except that the impurity
measure that is appropriate for classification is replaced by a measure appropriate for
regression. For node m, Xm is the subset of X reaching node m. It is the
set of all x ∈ X satisfying all the conditions in the decision nodes on the path from the root
until node m.

Pruning:
 In decision tree a node is not split further if the number of training instances
reaching a node is smaller than a certain percentage of the training set.
 Any decision based on too few instances causes variance and induces
generalization error.
 The process of stopping tree construction early on before it is full is called
prepruning the tree.
 In postpruning, a decision tree is generated and continued further on without
backtracking.
 In post pruning trees are grown until all leaves are pure with have no training error.
Then find the subtrees that cause overfitting and prune them.

Random Forest
Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that dataset.
The random forest algorithm establishes the outcome based on the predictions of the decision trees. It
predicts by taking the average or mean of the output from various trees. Increasing the number of trees
increases the precision of the outcome. It combines both the bagging and boosting techniques.
 Bagging: Creates a different training subset from sample training data with replacement
& the final output is based on majority voting.
 Boosting: Combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy.

Steps to construct Random Forest:

Features of Random Forest
 Diversity- Not all attributes/variables/features are considered while
making an individual tree, each tree is different.
 Immune to the curse of dimensionality
 Each tree is created independently out of different data and attributes.
This means that we can make full use of the CPU to build random
forests.
 Random forest does not demand the data to be segregated as train and
test. There will always be 30% of the data which is not seen by the
decision tree.
Stability of results arises because the result is based on majority voting/ averaging

Decision trees Random Forest
Decision trees suffers from overfitting if grown without
any control.
As the output of Random forests are based on average
or majority ranking, there is no overfitting.
A single decision tree is faster in computation. It is comparatively slower as aggregating the results of
the individual trees ma consume time.
When a data set with features is taken as input by a
decision tree it will formulate set of rules to predict the
outcome.
Random forest randomly selects observations, builds a
decision tree and the average result is taken. It is not
rule based.
Differences between decision trees and Random forest

Advantages of random forest
 It can be used in classification as well as regression problems.
 It solves the problem of overfitting as output is based on majority voting or
averaging.
 It performs well even if the data contains null/missing values.
 Each decision tree created is independent of the other thus it shows the property of
parallelization.
 It is highly stable as the average answers given by a large number of trees are taken.
 It maintains diversity as all the attributes are not considered while making each
decision tree though it is not true in all cases.
 It is immune to the curse of dimensionality.
 No train and test split is needed as there will always be 30% of the data which is not
seen by the decision tree.

Disadvantages
 It is highly complex compared to decision trees where decisions can be made
by following the path of the tree.
 Training time is more compared to other models due to its complexity.
Whenever it has to make a prediction each decision tree has to generate
output for the given input data.
 This requires more resources.

AI & ML(Unit III).pptx.It contains also syllabus

More Related Content

Similar to AI & ML(Unit III).pptx.It contains also syllabus (20)

More from NPRCET6 (8)

Recently uploaded (20)

AI & ML(Unit III).pptx.It contains also syllabus