SlideShare a Scribd company logo
MODULE-2
• Understanding Data – 2: Bivariate Data and Multivariate Data,
Multivariate Statistics, Essential Mathematics for Multivariate Data,
Feature Engineering and Dimensionality Reduction Techniques.
• Basic Learning Theory: Design of Learning System, Introduction to
Concept of Learning, Modelling in Machine Learning.
BIVARIATE DATA AND MULTIVARIATE DATA

Bivariate Data involves two variables. Bivariate data deals with causes of
relationships. The aim is to find relationships among data. Consider the following
Table 2.3, with data of the temperature in a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The
relationships can then be used in comparisons, finding causes, and in further
explorations. To do that, graphical display of the data is necessary. One such
graph method is called scatter plot.

Scatter plot is used to visualize bivariate data. It is useful to plot two variables
with or without nominal variables, to illustrate the trends, and also to show
differences. It is a plot between explanator and response variables. It is a 2D
graph showing the relationship between two variables.

The scatter plot (Refer Figure 2.11) indicates strength, shape, direction and the
presence of Outliers. It is useful in exploratory data before calculating a
correlation coefficient or fitting regression curve.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
2.6.1 Bivariate Statistics
Covariance and correlation re the examples of bivariate statistics. Covariance is a
measure of joint probability of random variable Generally, random variables are
represented in capital letters. !t is defined as covariance(X, Y) or COV(X, Y) and
is used to measure the variance between two dimensions. The formula for finding
co-variance for specific x, and y are:
Here, x, and y, are data values from X and Y. E(X) and E(Y) are the mean values
of x, and N is the number of given data. Also, the COV(X, Y) in same as COV(Y,
X)
The covariance between X and Y is 12. It can be normalized to a value between —1 and +1. This
is done by dividing it by the correlation variable and This is called Pearson correlation coefficient.
sometimes, N — 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15.
Correlation
The Pearson correlation coefficient is the most common test for determining any association
between two phenomena. It measures the strength and direction of a linear relationship between
the x and y variables.
The correlation indicates the relationship between dimensions using its sign. The sign is more
important than the actual value.
1. If the value is positive, it indicates that the dimensions increase together.

2. Ifthe value is negative, it indicates that while one-dimension increases, the
other dimension decreases.

3. If the value is zero, then it indicates that both the dimensions are independent
of each other.

If the dimensions are correlated, then it is better to remove one dimension as it is
a redundant dimension.

If the given attributes are X= (x1,x2...xn) and Y=(y1,y2,y3...yn), then the Pearson
correlation coefficient, that is denoted as , is given as:
2.7 MULTIVARIATE STATISTICS
In machine learning, almost all datasets are multivariable. Multivariate data is the
analysis of more than two observable variables, and often, thousands of multiple
measurements need to be conducted for one or more subjects.
The multivariate data is like bivariate data but may have more than two
dependant variables. Some of the multivariate analysis are regression analysis,
principal component analysis, and path analysis.
 The mean of multivariate data is a mean vector and the mean of the above three attributes
is given as (2, 7.5, 1.33). The variance of multivariate data becomes the covariance matrix.
 The mean vector is called centroid and variance is called dispersion matrix. This is
discussed in the next section.
 Multivariate data has three or more variables. The aim of the multivariate analysis is much
more. They are regression analysis, factor analysis and multivariate analysis of variance
that are explained in the subsequent chapters of this book.
Heatmap

Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and
colours it. The darker colours indicate very large values and lighter colours
indicate smaller values.

The advantage of this method is that humans perceive colours well. So, by colour
shaping, larger values can be perceived well.

For example, in vehicle traffic data, heavy traffic regions can be differentiated
from low traffic regions through heatmap.

In Figure 2.13, patient data highlighting weight and health status is plotted. Here,
X-axis is weights and Y-axis is patient counts. The dark colour regions highlight
patients’ weights vs patient counts in health status.
MODULE-3edited.pptx  machine learning modulk
Pairplot

Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter
matrix consists of several pair-wise scatter plots of variables of the multivariate
data. All the results are presented in a matrix format. By visual examination of the
chart, one can easily find relationships among the variables such as correlation
between the variables.

A random matrix of three columns is chosen and the relationships of the columns
is plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14.
ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA
• Machine learning involves many mathematical concepts from the domain of
Linear algebra, Statistics, Probability and Information theory. The subsequent
sections-discuss important aspects of linear algebra and probability.
• ‘Linear Algebra’ is a branch of mathematics that is central for many scientific
applications and other mathematical subjects. While all branches of
mathematics are crucial for machine learning, linear algebra plays a major
large role as it is the mathematics of data. Linear algebra deals with linear
equations, vectors, matrices, vector spaces and transformations.
Linear Systems and Gaussian Elimination for Multivariate Data
• A linear system of equations is a group of equations with unknown variables.
Let Ax = y, then the solution x is given as:
• This is true if y is not zero and A is not zero. The logic can be extended for N-set of
equations with ‘n’ unknown variables.
• If there is a unique solution, then the system is called consistent independent. If
there are various solutions, then the system is called consistent dependant. If there
are no solutions and if the equations are contradictory, then the system is called
inconsistent.
• For solving large number of system of equations, Gaussian elimination can be
used. The procedure for applying Gaussian elimination is given as follows:
• To facilitate the application of Gaussian elimination method, the following row
operations are applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it
Matrix Decompositions
• It is often necessary to reduce a matrix to its constituent parts so that complex
matrix operations can be performed. These methods are also known as matrix
factorization methods.
• The most popular matrix decomposition is called eigen decomposition. It is a
way of reducing the matrix into eigen values and eigen vectors.
• Then, the matrix A can be decomposed as
LU Decomposition
• One of the simplest matrix decompositions is LU decomposition where the
matrix A can be decomposed matrices:
A=LU
• Here, L. is the lower triangular matrix and U is the upper triangular matrix. The
decomposition can be done using Gaussian elimination method as discussed in
the previous section. First, an identity matrix is augmented to the given matrix.
Then, row operations and Gaussian elimination is applied to reduce the given
matrix to get matrices L and U.
• Example 2.9 illustrates the application of Gaussian elimination to get LU.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
Machine Learning and Importance of Probability and
Statistics
• Machine learning is linked with Statistics and probability. Like linear algebra,
statistics is the heart of machine learning. The importance of statistics needs to
be stressed as without statistics; analysis of data is difficult.
• Probability is especially important for machine learning. Any data can be
assumed to be generated by a probability distribution. Machine learning
datasets have multiple data that are generated by multiple distributions, So, a
knowledge of probability distribution and random variables are must for better
understanding of the machine learning concepts.
Probability Distributions
• A probability distribution of a variable, say X, summarizes the probability
associated with X's events. Distribution is a parameterized mathematical
function. In other words, distribution is a function that describes the
relationship between the observations in a sample space.
• Consider a set of data. The data is said to follow a distribution if it obeys a
mathematical function that characterizes that distribution. The function can be
used to calculate the probability of individual observations.
• Probability distributions are of two types:
• 1. Discrete probability distribution
• 2. Continuous probability distribution
• The relationships between the events for a continuous random variable and
their probabilities is called a continuous probability distribution. It is
summarized as Probability Density Function (PDF).
• PDF calculates the probability of observing an instance. The plot of PDF shows
the shape of the distribution. Cumulative Distributive Function (CDF) computes
the probability of an observation < value.
• Both PDF and CDF are continuous values. The discrete equivalent of PDF in
discrete distribution is called Probability Mass Function (PMF).
• The probability of an event cannot be detected directly. It should be computed
as the area under the curve for a small interval around the specific outcome.
This is defined as CDF.
• Let us discuss some of the distributions that are encountered in machine
learning.
• Continuous Probability Distributions Normal, Rectangular, and Exponential
distributions fall under this category.
• 1, Normal Distribution — Normal distribution is a continuous probability
distribution. This is also known as gaussian distribution or bell-shaped curve
distribution. It is the most common distribution function. The shape of this
distribution is a typical bell-shaped curve. In normal distribution, data tends to
be around a central value with no bias on left or right. The heights of the
students, blood pressure of a population, and marks scored in a class can be
approximated using normal distribution. PDF of the normal distribution is
given as:
• Here, m is mean and is the standard deviation. Normal distribution is
characterized - by two parameters - mean and variance.
• Mostly, one uses the normal distribution curve of mean 0 and a SD of 1. In
normal distribution, mean, median and mode are same. The distribution
extends from —co to +.
• Standard deviation is how the data is spread out.
• One important concept associated with normal distribution is z-score. It can be
computed as:
• Rectangular Distribution — This is also known as uniform distribution. It has
equal probabilities for all values in the range a, b. The uniform distribution is
given as follows:
• Exponential Distribution — This is a continuous uniform distribution. This
probability distribution is used to describe the time between events in a
Poisson process. Exponential distribution is another special case of Gamma
distribution with a fixed parameter of 1. This distribution is helpful in
modelling of time until an event occurs.
• The PDF is given as follows:
• Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under
this category,
• 1. Binomial Distribution — Binomial distribution is another distribution that is
often encountered in machine learning. It has only two outcomes: success or
failure. This is also called Bernoulli trial.
• The objective of this distribution is to find probability of getting success k out
of 1 trials, The
• way to get success out of k out of n number of trials is given as:
• The binomial distribution function is given as follows, where p is the
probability of success and probability of failure is (1 — p). The probability of
success in a certain number of trials is given as:
• Combining both, one gets PDF of binomial distribution as:
• Poisson Distribution — It is another important distribution that is quite useful.
Given an interval of time, this distribution is used to model the probability of a
given number of events k. The mean rule A is inclusive of previous events.
Some of the examples of Poisson distribution are number of emails received,
number of customers visiting a shop and the number of phone calls received
by the office.
• The PDF of Poisson distribution is given as follows:
• Bernoulli Distribution — This distribution models an experiment whose
outcome is binary. The outcome is positive with p and negative with 1 — p.
The PMF of this distribution is given as
Density Estimation
• Let there be a set of observed values x1,x2,…..,xn from a larger set of data
whose distribution is not known, Density estimation is the problem of
estimating the density function from an observed data.
• The estimated density function, denoted as, p(x) can be used to value directly
for any unknown data, say x, as p(x). If its value is less than e, then x, is not an
outlier or anomaly data. Else, it is categorized as an anomaly data.
• There are two types of density estimation methods, namely parametric density
estimation and non-parametric density estimation.
• Parametric Density Estimation It assumes that the data is from a known
probabilistic distribution and can be estimated as is the
parameter. Maximum likelihood function is a parametric estimation method.
• Maximum Likelihood Estimation For a sample of observations, one can
estimate the probability distribution. This is called density estimation.
Maximum Likelihood Estimation (MLE) is a probabilistic framework that can be
used for density estimation.
• This involves formulating a function called likelihood function which is the
conditional probability of observing the observed samples and distribution
function with its parameters.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
• Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In
machine learning, clustering is one of the important tasks. It is discussed in
Chapter 13. MLE framework is quite useful for designing model-based
methods for clustering data. A model is a statistical method and data is
assumed to be generated by a distribution model with its parameter, 0.
• There may be many distributions involved and that is why it is called as
mixture model. Since, Gaussian are normally assumed for data, this mixture
model is categorized as Gaussian Mixture Model (GMM).
• The EM algorithm is one algorithm that is commonly used for estimating the
MLE in the presence of latent or missing variables. What is a latent variable?
Let us assume that the dataset includes weights of boys and girls. Considering
the fact that the boys’ weights would be slightly more than the weights of the
girls, one can assume that the larger weights are generated by one gaussian
distribution with one set of parameters while girls’ weights are generated with
another set of parameters. There is an influence of gender in the data, but it is
not directly present or observable. These are called latent variables. The EM
algorithm is effective for estimating the PDF in the presence of latent variables.
• one can assume that the larger weights are generated by one gaussian distribution
with one set of parameters while girls’ weights are generated with another set of
parameters. There is an influence of gender in the data, but it is not directly
present or observable. These are called latent variables. The EM algorithm is
effective for estimating the PDF in the presence of latent variables
• Generally, there can be many unspecified distributions with different set of
parameters. The EM algorithm has two stages:
• 1. Expectation (E) Stage — In this stage, the expected PDF and its parameters are
estimated for each latent variable.
• 2. Maximization (M) stage — In this, the parameters are optimized using the MLE
function. This process is iterative, and the iteration is continued till all the latent
variables are fitted by probability distributions effectively along with the
parameters.
• Non-parametric Density Estimation A non-parametric estimation can be
generative or discriminative. Parzen window is a generative estimation method
that finds p(x|Ɵ ) as conditional density. Discriminative methods directly
compute p(Ɵ | x) as posteriori probability. Parzen window and k-Nearest
Neighbour (KNN) rule are examples of non-parametric density estimation. Let
us discuss about them now.
MODULE-3edited.pptx  machine learning modulk
• KNN Estimation The KNN estimation is another non-parametric density estimation
method. Here, the initial parameter k is determined and based on that k-neighbours
are determined. The probability density function estimate is the average of the
values that are returned by the neighbours.
2.10 FEATURE ENGINEERING AND DIMENSIONALITY
REDUCTION TECHNIQUES
• Features are attributes. Feature engineering is about determining the subset of
features that form an important part of the input that improves the performance of
the model, be it classification or any other model in machine learning.
• Feature engineering deals with two problems — Feature Transformation and Feature
Selection. Feature transformation is extraction of features and creating new features
that may be helpful in increasing performance. For example, the height and weight
may give a new attribute called Body Mass Index (BMI).
• Feature subset selection is another important aspect of feature engineering that focuses
on selection of features to reduce the time but not at the cost of reliability.
• The subset selection reduces the dataset size by removing irrelevant features and
constructs a minimum set of attributes for machine learning. If the dataset has n
attributes, then time complexity is extremely high as n dimensions need to be processed
for the given dataset.
• For n attributes, there are 2" possible subsets. If the value of n is high, the problem
becomes intractable. This is called ‘curse of dimensionality’. Since, as the number of
dimensions increases, the time complexity increases. The remedy is that some of the
components that do not contribute much can be deleted. This results in the reduction of
dimensionality.
• Choosing optimal attributes becomes a graph search problem. Typically, the feature
subset selection problem uses greedy approach by looking for the best choice at the time
using locally optimal choice while hoping that it would lead to global optimal solutions
• The features can be removed based on two aspects:
• 1, Feature relevancy - Some features contribute more for classification than
other features. For example, a mole on the face can help in face detection than
common features like nose. In simple words, the features should be relevant.
The relevancy of the features can be determined based on information
measures such as mutual information, correlation based features like
correlation coefficient and distance measures. Distance measures are
discussed in Chapter 13 of this book.
• 2. Feature redundancy - Some features are redundant. For example, when a
database table has a field called Date of birth, then age field is not relevant as
age can be computed easily from date of birth. This helps in removing the
column age that leads to reduction of dimension one.
• So, the procedure is:
• 1. Generate all possible subsets
• 2. Evaluate the subsets and model performance
• 3. Evaluate the results for optimal feature selection
• Filter-based selection uses statistical measures for assessing features. In this
approach, noo learning algorithm is used. Correlation and information gain
measures like mutual information and entropy are all examples of this
approach.
• Wrapper-based methods use classifiers to identify the best features. These are
selected and evaluated by the learning algorithms. This procedure is
computationally intensive but has superior performance.
• Stepwise Forward Selection
• This procedure starts with an empty set of attributes. Every time, an attribute
is tested for statistical significance for best quality and is added to the reduced
set. This process is continued till a Brod reduced set of attributes is obtained.
• 2.10.2 Stepwise Backward Elimination
• This procedure starts with a complete set of attributes. At every stage, the
procedure removes the worst attribute from the set, leading to the reduced
set.
• Combined Approach Both forward and reverse methods can be combined so
that the procedure can add the best attribute and remove the worst attribute.
2.10.3 Principal Component Analysis
• The idea of the principal component analysis (PCA) or KL transform is to
transform a given set of measurements to a new set of features so that the
features exhibit high information packing properties. This leads to a reduced
and compact set of features. Basically, this elimination is made possible
because of the information redundancies. This compact representation is of a
reduced dimension.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
• The advantages of PCA are immense. It reduces the attribute list by eliminating
all irrelevant attributes. The PCA algorithm is as follows:
• 1. The target dataset x is obtained
• 2. The mean is subtracted from the dataset. Let the mean be m. Thus, the
adjusted dataset is
• X —m. The objective of this process is to transform the dataset with zero
mean.
• 3. The covariance of dataset x is obtained. Let it be C.
• 4. Eigen values and eigen vectors of the covariance matrix are calculated.
• 5. The eigen vector of the highest eigen value is the principal component of
the dataset. The eigen values are arranged in a descending order. The feature
vector is formed with these eigen vectors in its columns. Feature vector =
{eigen vector,, eigen vector, ... , eigen vector,}
• 6. Obtain the transpose of feature vector. Let it be A.
• 7, PCA transform is y = A x (x ~ m), where x is the input dataset, m is the mean,
and A is the transpose of the feature vector.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
• The process of acquiring knowledge and expertise through study, experience,
or being taught j, called as learning. Generally, humans learn in different ways.
To make machines learn, we need to simulate the strategies of human learning
in machines. But, will the computers learn? This question has been raised over
many centuries by philosophers, mathematicians and logicians. First let us
address the question - What sort of tasks can the computers learn? This
depends op, the nature of problems that the computers can solve, There are
two kinds of problems — well-posed and ill-posed. Computers can solve only
well posed problems, as these have well-defined specifications and have the
following components inherent to it.
• A Class of learning tasks (T)
• A measure of performance (P)
• A source of experience (E)
• The standard definition of learning proposed by Tom Mitchell is that a
program can learn from E for the task T, and P improves with experience E.
• Let us formalize the concept of learning follows: Let x be the input and X be
the input space, which is the set of all inputs, and Y is the output space, which
is the set of all possible outputs, that is, yes/no.
• Let D be the input dataset with examples, (x1,x2), (x2,x2) …(Xn,Yn) for n
inputs. the unknown target function be f: X —> Y, that maps the input space to
output space. The objective of the learning Program is to pick a function, g: X
—> Y to approximate hypothesis f
• All the possible formulae form a hypothesis space. In short, let H be the set of
all formulae from which the learning algorithm chooses. The choice is good
when the hypothesis g replicates f for all samples. This is shown in Figure 3.1.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
Classical and Adaptive Machine Learning Systems
• A classical machine learning system has components such as Input, Process and
Output. The input values are taken from the environment directly. These values are
processed and a hypothesis is generated as output model. This model is then used
for making predictions. The predicted values are consumed by the environment.
• In contrast to the classical systems, adaptive systems interact with the input for
getting labelled data as direct inputs are not available. This process is called
reinforcement learning.
• In reinforcement learning, a learning agent interacts with the environment and in
return gets feedback. Based on the feedback, the learning agent generates input
samples for learning, which are used for generating the learning model. Such
learning agents are not static and change their behaviour according to the external
signal received from the environment.
• Learning Types
Learning Types
• There are different types of learning. Some of the different learning methods
are as follows:
• Learn by memorization or learn by repetition also called as rote learning is
done by memorizing without understanding the logic or concept. Although
rote learning is basically learning by repetition, in machine learning
perspective, the learning occurs by simply comparing with the existing
knowledge for the same input data and producing the output if present.
• Learn by examples also called as learn by experience or previous knowledge
acquired at some time, is like finding an analogy, which means performing
inductive learning from observations that formulate a general concept. Here,
the learner learns by inferring a general rule from the set of observations or
examples. Therefore, inductive learning is also called as discovery learning
• Learn by being taught by an expert or a teacher, generally called as passive learning, However,
there is a special kind of learning called active learning where the learner c interactively query a
teacher/expert to label unlabelled data instances with the design 4 outputs. learning,
• Learning by critical thinking, also called as deductive learning, deduces new facts o, conclusion
from related known facts and information.
• Self also called as reinforcement learning, is a self-directed learning that normally learns from
mistakes punishments and rewards.
• the Learning to solve problems js a type of cognitive learning where learning happens in goal.
mind and is possible by devising a methodology to achieve a goal. Here, the learner initially is
not aware of the solution or the way to achieve the goal but only know, the The learning
happens either directly from the initial state by following the steps to achieve the goal or
indirectly by inferring the behaviour.
• Learning by generalizing explanations, also called as explanation-based learning (EBL) another
learning method that exploits domain knowledge from experts to improve the accuracy of
learned concepts by supervised learning.
DESIGN OF A LEARNING SYSTEM
• A system that is built around a learning algorithm is called a learning system.
• the design of systems focuses on these steps:
• Choosing a training experience
• Choosing a target function
• Representation of a target function
• Function approximation
Training Experience
• Let us consider designing of a chess game. In direct experience, individual
board states and correct moves of the chess game are given directly. In indirect
system, the move sequences and results are only given.
• The training experience also depends on the presence of a supervisor who can
label all valid moves for a board state. In the absence of a supervisor, the game
agent plays against itself and learns the good moves, if the training samples cover
all scenarios, or in other words, distributed enough for performance computation.
If the training samples and testing samples have the same distribution, the results
would be good
Determine the Target Function
• The next step is the determination of a target function. In this step, the type of
knowledge that needs to be learnt is determined. In direct experience, a board
move is selected and is determined whether it is a good move or not against all
other moves. If it is the best move, then it is chosen as: B —> M, where, B and M
are legal moves.
• In indirect experience, all legal moves are accepted and score is generated for
each. The move with. largest score is then chosen and executed.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
• INTRODUCTION TO CONCEPT LEARNING
• Concept leaming is a learning strategy of acquiring abstract knowledge or
inferring a genera] concept or deriving a category from the given training
samples. It is a process of abstraction and generalization from the data.
Concept learning helps to classify an object that has a set of common, relevant
features.
• The learner tries to simplify by observing the common features from the
training samples and then apply this simplified model to the future samples.
This task is also known as learning from experience. Each concept or category
obtained by learning is a Boolean valued function which takes a true or false
value. For example, humans can identify different kinds of animals based on
common relevant features and categorize all animals based on specific sets of
features.
• The special features that distinguish one animal from another can be called as a
concept. This way of learning categories for It is object and to recognize new
instances of those categories is called as concept learning, formally defined as
inferring a Boolean valued function by processing training instances, Concept
learning requires three things:
• Input — Training dataset which is a set of training instances, each labeled with the
name of concept or category to which it belongs. Use this past experience to train
and build the model.
• Output — Target concept or Target function f It is a mapping function f(x) from
input x to output y. It is to determine the specific features or common features to
identify an object. In other words, it is to find the hypothesis to determine the
target concept. For e.g., the specific set of features to identify an elephant from all
animals.
• Test — New instances to test the learned model.
• Formally, Concept learning is defined as—"Given a set of hypotheses, the learner searches
through the hypothesis space to identify the best hypothesis that matches the target
concept”. Consider the following set of training instances shown in Table 3.1.
• Here, in this set of training instances, the independent attributes considered are ‘Horns’,
‘Tail’, ‘Tusks’, ‘Paws’, ‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’.
• The target concept is to identify the animal to be an Elephant. Let us now take this example
and understand further the concept of hypothesis. Target Concept: Predict the type of
animal - For example —Elephant
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
Hypothesis Space
• Hypothesis space is the set of all possible hypotheses that approximates the
target function f.
• In other words, the set of all possible approximations of the target function
can be defined as hypothesis space. From this set of hypotheses in the
hypothesis space, a machine learning algorithm would determine the best
possible hypothesis that would best describe the target function or best fit the
outputs.
• Generally, a hypothesis representation language represents a larger hypothesis
space. Every machine learning algorithm would represent the hypothesis space
in a different manner about the function that maps the input variables to
output variables.
• For example, a regression algorithm represents the hypothesis space as a
linear function whereas a decision tree algorithm represents the hypothesis
space as a tree.
• The set of hypotheses that can be generated by a learning algorithm can be
further reduced by specifying a language bias.
• The subset of hypothesis space that is consistent with all-observed training
instances is called as Version Space. Version space represents the only
hypotheses that are used for the classification.
• For example, each of the attribute given in the Table 3.1 has the following
possible set of values.
• Considering these values for each of the attribute, there are (2x2 x2x2x2x3x2
x2) =384 distinct instances covering all the 5 instances in the training dataset.
• So, we can generate (4x4x4x4x4x5x4x4) =81,920 distinct hypotheses when
including two more values [?, Φ] for each of the attribute. However, any
hypothesis containing one or more
• Φ symbols represents the empty set of instances; that is, it classifies every
instance as negative instance.
• Therefore, there will be (8x3x3x3x3x4x3x3 +1) =8,749 distinct hypotheses by
including only ‘?’ for each of the attribute and one hypothesis representing the
empty set of instances. Thus, the hypothesis space is much larger and hence
we need efficient learning algorithms to search for the best hypothesis from
the set of hypotheses.
• Hypothesis ordering is also important wherein the hypotheses are ordered
from the most specific one to the most general one in order to restrict
searching the hypothesis space exhaustively.
Heuristic Space Search
• Heuristic search is a search strategy that finds an optimized
hypothesis/solution to a problem by iteratively improving the
hypothesis/solution based on a given heuristic function or a cost measure.
• Heuristic search methods will generate a possible hypothesis that can be a
solution in the hypothesis Space or a path from the initial state. This
hypothesis will be tested with the target function or goal state to see if its real.
• Several commonly used heuristic search methods are hill climbing methods,
constraint satisfaction problems, best-first search, simulated-annealing, A*
algorithm, and genetic algorithms,
Generalization and Specialization
• In order to understand about how we construct this concept hierarchy, let us
apply this genera] principle of generalization/specialization relation, By
generalization of the most specific hypothesis and by specialization of the most
general hypothesis, the hypothesis space can be searched for an approximate
hypothesis that matches all positive instances but does not match any negative
instance.
Searching the Hypothesis Space
• There are two ways of learning the hypothesis, consistent with all training
instances from the large
• hypothesis space.
• 1. Specialization —- General to Specific learning
• 2. Generalization — Specific to General learning
• Generalization – Specific to General Learning This learning methodology will
search through the hypothesis space for an approximate hypothesis by
generalizing the most specific hypothesis.
MODULE-3edited.pptx  machine learning modulk
• Specialization - General to Specific Learning This learning methodology will
search through the hypothesis space for an approximate hypothesis by
specializing the most general hypothesis.
MODULE-3edited.pptx  machine learning modulk
Hypothesis Space Search by Find-S Algorithm
• Find-S algorithm is guaranteed to converge to the most specific hypothesis in H
that is consistent with the positive instances in the training dataset. Obviously,
it will also be consistent with the negative instances.
• Thus, this algorithm considers only the positive instances and eliminates
negative instances while generating the hypothesis. It initially starts with the
most specific hypothesis.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
Limitations of Find-S Algorithm
• 1. Find-S algorithm tries to find a hypothesis that is consistent with positive
instances, ignoring all negative instances. As long as the training dataset is
consistent, the hypothesis found by this algorithm may be consistent.
• 2. The algorithm finds only one unique hypothesis, wherein there may be
many other hypotheses that are consistent with the training dataset
• 3.Many times, the training dataset may contain some errors; hence such
inconsistent data may mislead this algorithm in determining consistent
hypothesis.
Version Spaces
• The version Space contains the subset of hypotheses from the hypothesis
space that is consistent with all training instances in the training dataset.
List-Then-Eliminate Algorithm
• The principle idea of this learning algorithm is to initialize the version space to
contain all hypotheses and then eliminate any hypothesis that is found
inconsistent with any training instances. Initially, the algorithm starts with a
version space to contain all hypotheses scanning each training instance.
• The hypotheses that are inconsistent with the training instance are eliminated.
Finally, the algorithm outputs the list of remaining hypotheses that are all
consistent.
Version Spaces and the Candidate Elimination Algorithm
• Version space learning is to generate all consistent hypotheses around. This
algorithm computes the version space by the combination of the two cases
namely,
• * Specific to General learning — Generalize S to include the positive example
• * General to Specific learning — Specialize G to exclude the negative example
• Using the Candidate Elimination algorithm, we can compute the version space
containing all (and only those) hypotheses from H that are consistent with the
given observed sequence of training instances.
• The algorithm defines two boundaries called ‘general boundary’ which is a set
of all hypotheses that are the most general and ‘specific boundary’ which is a
set of all hypotheses that are the most specific. Thus, the algorithm limits the
version space to contain only those hypotheses that are most general and
most specific. Thus, it provides a compact representation of List-then
algorithm.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
INDUCTION BIASES
• Induction is a process of learning a target function or generalizing training data
into a general model.
• Inductive bias is the set of prior assumptions considered by a learning algorithm
beyond the training data, in order to perform induction. It is also called as the
bias of the algorithm.
• It is similar to prior knowledge used for learning new concepts. For example,
linear regression learning assumes that the predictors (independent variables)
are related to the target variable. Similarly, C4.5 decision tree learning algorithm
assumes to always choose greedy best attributes as split criterion when
constructing the decision tree.
There are two types of bias:
– 1. Constraint or Restriction — limit the hypothesis space
– 2. Preference — impose ordering on hypothesis space
• Bias and Variance
• In supervised machine learning models, we need to find the target function f(x) for an
input value ‘y’ that best maps the predicted output values to actual output values. If the
predicted output value deviates from the actual output value, then we call it an error.
There are three kinds of prediction errors—Bias error, Variance error and Irreducible
error.
• 1. Irreducible errors cannot be reduced or avoided which normally happens because of
various factors like unknown variables, noise, etc. These errors can sometimes be
avoided by data cleaning.
• 2. A Bias error is the difference between the predicted output value and the actual
output value of any learning model. Bias defines the accuracy of model predictions or
the Mean Square Error (MSE) in the predictions. This bias error occurs due to inaccurate
and simplifying assumptions or under-fitting during learning with the training data. The
error is prominent when the test data is provided to the learned model.
• 3. A Variance error is the change in the target function sensitive to outliers and
noise or small fluctuation in the training dataset. This change in the estimated
target function occurs due to outliers, noise and when the number and types
of parameters used to derive the mapping function change with different
training sets. On the other hand, variance is the variations or spread from the
actual value that can be seen between many models’ predictions estimated
from different training sets. Pictorial representation of bias and variance is
shown in Figure 3.3.
Bias vs Variance Tradeoff
Best Fit in Machine Learning
• Generalizing the hypothesis from training instances to a specific model is called
inductive learning.
• Generalization basically describes the model's ability to infer or predict correctly
a new unseen data after being trained with a training dataset. If a model is over
trained or under trained, then its predictions are not going to be accurate.
• Generally, performance of a machine learning model or predictions deteriorate
when it learns too much or too less from the training instances. When a machine
learning model learns too much from the training instances including noise,
overfitting occurs. In other words, the model tries to fit the training data
instances too well or predictors/features/independent variables are too complex.
• Moreover, when a model performs very well on the training data but poorly on
the test data, then it is also an overfitting problem.
• Overfitting occurs more likely with non-parametric and non-linear models that
have more flexibility when learning a target function.
• Underfitting generally occurs when a machine learning model could not learn
from the training instances or the instances do not match with the model to
learn or when predictors are very simple. In other words, the model does not
fit the data well enough.
• Ideally, the goal of selecting a machine learning model is to provide a
performance in predictions between underfitting and overfitting. Hence, it is
essential to find a model that provides a good fit but practically it is very
difficult to achieve.
• Model Selection and Model Evaluation
• The biggest challenge in machine learning is choosing an algorithm that suits
the problem. Hence, model selection and assessment are very important and
deal with two types of complexities.
• 1. Model Performance — How well the model performs on the training dataset?
• 2. Model Complexity - How much complexity the model possesses after the
training phase is over?
• Model Selection is a process of selecting one good enough model among
different machine learning models for the dataset or selecting different sets of
features or hyper parameters for the same machine learning model. It is
difficult to find the best model because all models exhibit some predictive error
for the problem, so at least a good enough model should be selected that
performs fairly well with the dataset.
• Some of the approaches used for selecting a machine learning model are listed
below:
• 1. Use resample methods and split the dataset as training, testing and
validation datasets and observe the performance of the model over all the
phases. This approach is suitable for smaller datasets.
• 2. The simplest approach is to fit a model on the training dataset and to
compute measures like error or accuracy.
• 3, The use of probabilistic framework and quantification of the performance of
the model as a score is the third approach.
• These methods are discussed in the following sections.
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk
MODULE-3edited.pptx  machine learning modulk

More Related Content

PDF
Unit-3 Data Analytics.pdf
PDF
Linear Algebra – A Powerful Tool for Data Science
PDF
Dr. Shivu___Machine Learning_Module 2pdf
PDF
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
PDF
The normal presentation about linear regression in machine learning
PPTX
AI & ML(Unit III).pptx.It contains also syllabus
PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
PPTX
DA//////////////////////////////////////// Unit 2.pptx
Unit-3 Data Analytics.pdf
Linear Algebra – A Powerful Tool for Data Science
Dr. Shivu___Machine Learning_Module 2pdf
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
The normal presentation about linear regression in machine learning
AI & ML(Unit III).pptx.It contains also syllabus
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
DA//////////////////////////////////////// Unit 2.pptx

Similar to MODULE-3edited.pptx machine learning modulk (20)

PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PPTX
collectionandrepresentationofdata1-200904192336.pptx
PDF
KIT-601 Lecture Notes-UNIT-2.pdf
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
DOCX
Statistics digital text book
PPTX
numerical method chapter 5.pptxggggvvgggbg
PDF
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
PDF
Chapter_5 Fundamentals of statisticsl.pdf
PPTX
3.1 Measures of center
PDF
A Novel Algorithm for Design Tree Classification with PCA
PDF
1376846406 14447221
PDF
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
PPTX
03 Data Mining Techniques
PDF
Principal component analysis and lda
DOCX
Mc0079 computer based optimization methods--phpapp02
PDF
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
PPTX
Chapter Three Univarite Anaalysis.pptx
PPTX
Data Representations
PPTX
Discriminant analysis.pptx
PDF
Regression Analysis-Machine Learning -Different Types
Data_Analytics_for_IoT_Solutions.pptx.pdf
collectionandrepresentationofdata1-200904192336.pptx
KIT-601 Lecture Notes-UNIT-2.pdf
Model Evaluation & Visualisation part of a series of intro modules for data ...
Statistics digital text book
numerical method chapter 5.pptxggggvvgggbg
MSC III_Research Methodology and Statistics_Descriptive statistics.pdf
Chapter_5 Fundamentals of statisticsl.pdf
3.1 Measures of center
A Novel Algorithm for Design Tree Classification with PCA
1376846406 14447221
IT-601 Lecture Notes-UNIT-2.pdf Data Analysis
03 Data Mining Techniques
Principal component analysis and lda
Mc0079 computer based optimization methods--phpapp02
Deepak_DAI101_Data_Anal_lecture6 (1).pdf
Chapter Three Univarite Anaalysis.pptx
Data Representations
Discriminant analysis.pptx
Regression Analysis-Machine Learning -Different Types
Ad

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Welding lecture in detail for understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PDF
PPT on Performance Review to get promotions
PPT
Mechanical Engineering MATERIALS Selection
PPTX
web development for engineering and engineering
PPTX
additive manufacturing of ss316l using mig welding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
CYBER-CRIMES AND SECURITY A guide to understanding
573137875-Attendance-Management-System-original
Embodied AI: Ushering in the Next Era of Intelligent Systems
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Welding lecture in detail for understanding
Foundation to blockchain - A guide to Blockchain Tech
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
PPT on Performance Review to get promotions
Mechanical Engineering MATERIALS Selection
web development for engineering and engineering
additive manufacturing of ss316l using mig welding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Ad

MODULE-3edited.pptx machine learning modulk

  • 1. MODULE-2 • Understanding Data – 2: Bivariate Data and Multivariate Data, Multivariate Statistics, Essential Mathematics for Multivariate Data, Feature Engineering and Dimensionality Reduction Techniques. • Basic Learning Theory: Design of Learning System, Introduction to Concept of Learning, Modelling in Machine Learning.
  • 2. BIVARIATE DATA AND MULTIVARIATE DATA  Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is to find relationships among data. Consider the following Table 2.3, with data of the temperature in a shop and sales of sweaters.
  • 3.  Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be used in comparisons, finding causes, and in further explorations. To do that, graphical display of the data is necessary. One such graph method is called scatter plot.  Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal variables, to illustrate the trends, and also to show differences. It is a plot between explanator and response variables. It is a 2D graph showing the relationship between two variables.  The scatter plot (Refer Figure 2.11) indicates strength, shape, direction and the presence of Outliers. It is useful in exploratory data before calculating a correlation coefficient or fitting regression curve.
  • 6. 2.6.1 Bivariate Statistics Covariance and correlation re the examples of bivariate statistics. Covariance is a measure of joint probability of random variable Generally, random variables are represented in capital letters. !t is defined as covariance(X, Y) or COV(X, Y) and is used to measure the variance between two dimensions. The formula for finding co-variance for specific x, and y are: Here, x, and y, are data values from X and Y. E(X) and E(Y) are the mean values of x, and N is the number of given data. Also, the COV(X, Y) in same as COV(Y, X)
  • 7. The covariance between X and Y is 12. It can be normalized to a value between —1 and +1. This is done by dividing it by the correlation variable and This is called Pearson correlation coefficient. sometimes, N — 1 is also can be used instead of N. In that case, the covariance is 60/4 = 15. Correlation The Pearson correlation coefficient is the most common test for determining any association between two phenomena. It measures the strength and direction of a linear relationship between the x and y variables. The correlation indicates the relationship between dimensions using its sign. The sign is more important than the actual value. 1. If the value is positive, it indicates that the dimensions increase together.
  • 8.  2. Ifthe value is negative, it indicates that while one-dimension increases, the other dimension decreases.  3. If the value is zero, then it indicates that both the dimensions are independent of each other.  If the dimensions are correlated, then it is better to remove one dimension as it is a redundant dimension.  If the given attributes are X= (x1,x2...xn) and Y=(y1,y2,y3...yn), then the Pearson correlation coefficient, that is denoted as , is given as:
  • 9. 2.7 MULTIVARIATE STATISTICS In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of more than two observable variables, and often, thousands of multiple measurements need to be conducted for one or more subjects. The multivariate data is like bivariate data but may have more than two dependant variables. Some of the multivariate analysis are regression analysis, principal component analysis, and path analysis.
  • 10.  The mean of multivariate data is a mean vector and the mean of the above three attributes is given as (2, 7.5, 1.33). The variance of multivariate data becomes the covariance matrix.  The mean vector is called centroid and variance is called dispersion matrix. This is discussed in the next section.  Multivariate data has three or more variables. The aim of the multivariate analysis is much more. They are regression analysis, factor analysis and multivariate analysis of variance that are explained in the subsequent chapters of this book.
  • 11. Heatmap  Heatmap is a graphical representation of 2D matrix. It takes a matrix as input and colours it. The darker colours indicate very large values and lighter colours indicate smaller values.  The advantage of this method is that humans perceive colours well. So, by colour shaping, larger values can be perceived well.  For example, in vehicle traffic data, heavy traffic regions can be differentiated from low traffic regions through heatmap.  In Figure 2.13, patient data highlighting weight and health status is plotted. Here, X-axis is weights and Y-axis is patient counts. The dark colour regions highlight patients’ weights vs patient counts in health status.
  • 13. Pairplot  Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several pair-wise scatter plots of variables of the multivariate data. All the results are presented in a matrix format. By visual examination of the chart, one can easily find relationships among the variables such as correlation between the variables.  A random matrix of three columns is chosen and the relationships of the columns is plotted as a pairplot (or scattermatrix) as shown below in Figure 2.14.
  • 14. ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA • Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics, Probability and Information theory. The subsequent sections-discuss important aspects of linear algebra and probability. • ‘Linear Algebra’ is a branch of mathematics that is central for many scientific applications and other mathematical subjects. While all branches of mathematics are crucial for machine learning, linear algebra plays a major large role as it is the mathematics of data. Linear algebra deals with linear equations, vectors, matrices, vector spaces and transformations. Linear Systems and Gaussian Elimination for Multivariate Data • A linear system of equations is a group of equations with unknown variables. Let Ax = y, then the solution x is given as:
  • 15. • This is true if y is not zero and A is not zero. The logic can be extended for N-set of equations with ‘n’ unknown variables. • If there is a unique solution, then the system is called consistent independent. If there are various solutions, then the system is called consistent dependant. If there are no solutions and if the equations are contradictory, then the system is called inconsistent.
  • 16. • For solving large number of system of equations, Gaussian elimination can be used. The procedure for applying Gaussian elimination is given as follows:
  • 17. • To facilitate the application of Gaussian elimination method, the following row operations are applied: 1. Swapping the rows 2. Multiplying or dividing a row by a constant 3. Replacing a row by adding or subtracting a multiple of another row to it Matrix Decompositions • It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be performed. These methods are also known as matrix factorization methods. • The most popular matrix decomposition is called eigen decomposition. It is a way of reducing the matrix into eigen values and eigen vectors. • Then, the matrix A can be decomposed as
  • 18. LU Decomposition • One of the simplest matrix decompositions is LU decomposition where the matrix A can be decomposed matrices: A=LU • Here, L. is the lower triangular matrix and U is the upper triangular matrix. The decomposition can be done using Gaussian elimination method as discussed in the previous section. First, an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is applied to reduce the given matrix to get matrices L and U. • Example 2.9 illustrates the application of Gaussian elimination to get LU.
  • 21. Machine Learning and Importance of Probability and Statistics • Machine learning is linked with Statistics and probability. Like linear algebra, statistics is the heart of machine learning. The importance of statistics needs to be stressed as without statistics; analysis of data is difficult. • Probability is especially important for machine learning. Any data can be assumed to be generated by a probability distribution. Machine learning datasets have multiple data that are generated by multiple distributions, So, a knowledge of probability distribution and random variables are must for better understanding of the machine learning concepts.
  • 22. Probability Distributions • A probability distribution of a variable, say X, summarizes the probability associated with X's events. Distribution is a parameterized mathematical function. In other words, distribution is a function that describes the relationship between the observations in a sample space. • Consider a set of data. The data is said to follow a distribution if it obeys a mathematical function that characterizes that distribution. The function can be used to calculate the probability of individual observations. • Probability distributions are of two types: • 1. Discrete probability distribution • 2. Continuous probability distribution
  • 23. • The relationships between the events for a continuous random variable and their probabilities is called a continuous probability distribution. It is summarized as Probability Density Function (PDF). • PDF calculates the probability of observing an instance. The plot of PDF shows the shape of the distribution. Cumulative Distributive Function (CDF) computes the probability of an observation < value. • Both PDF and CDF are continuous values. The discrete equivalent of PDF in discrete distribution is called Probability Mass Function (PMF). • The probability of an event cannot be detected directly. It should be computed as the area under the curve for a small interval around the specific outcome. This is defined as CDF. • Let us discuss some of the distributions that are encountered in machine learning.
  • 24. • Continuous Probability Distributions Normal, Rectangular, and Exponential distributions fall under this category. • 1, Normal Distribution — Normal distribution is a continuous probability distribution. This is also known as gaussian distribution or bell-shaped curve distribution. It is the most common distribution function. The shape of this distribution is a typical bell-shaped curve. In normal distribution, data tends to be around a central value with no bias on left or right. The heights of the students, blood pressure of a population, and marks scored in a class can be approximated using normal distribution. PDF of the normal distribution is given as:
  • 25. • Here, m is mean and is the standard deviation. Normal distribution is characterized - by two parameters - mean and variance. • Mostly, one uses the normal distribution curve of mean 0 and a SD of 1. In normal distribution, mean, median and mode are same. The distribution extends from —co to +. • Standard deviation is how the data is spread out. • One important concept associated with normal distribution is z-score. It can be computed as:
  • 26. • Rectangular Distribution — This is also known as uniform distribution. It has equal probabilities for all values in the range a, b. The uniform distribution is given as follows: • Exponential Distribution — This is a continuous uniform distribution. This probability distribution is used to describe the time between events in a Poisson process. Exponential distribution is another special case of Gamma distribution with a fixed parameter of 1. This distribution is helpful in modelling of time until an event occurs. • The PDF is given as follows:
  • 27. • Discrete Distribution Binomial, Poisson, and Bernoulli distributions fall under this category, • 1. Binomial Distribution — Binomial distribution is another distribution that is often encountered in machine learning. It has only two outcomes: success or failure. This is also called Bernoulli trial. • The objective of this distribution is to find probability of getting success k out of 1 trials, The • way to get success out of k out of n number of trials is given as: • The binomial distribution function is given as follows, where p is the probability of success and probability of failure is (1 — p). The probability of success in a certain number of trials is given as:
  • 28. • Combining both, one gets PDF of binomial distribution as:
  • 29. • Poisson Distribution — It is another important distribution that is quite useful. Given an interval of time, this distribution is used to model the probability of a given number of events k. The mean rule A is inclusive of previous events. Some of the examples of Poisson distribution are number of emails received, number of customers visiting a shop and the number of phone calls received by the office. • The PDF of Poisson distribution is given as follows: • Bernoulli Distribution — This distribution models an experiment whose outcome is binary. The outcome is positive with p and negative with 1 — p. The PMF of this distribution is given as
  • 30. Density Estimation • Let there be a set of observed values x1,x2,…..,xn from a larger set of data whose distribution is not known, Density estimation is the problem of estimating the density function from an observed data. • The estimated density function, denoted as, p(x) can be used to value directly for any unknown data, say x, as p(x). If its value is less than e, then x, is not an outlier or anomaly data. Else, it is categorized as an anomaly data. • There are two types of density estimation methods, namely parametric density estimation and non-parametric density estimation. • Parametric Density Estimation It assumes that the data is from a known probabilistic distribution and can be estimated as is the parameter. Maximum likelihood function is a parametric estimation method.
  • 31. • Maximum Likelihood Estimation For a sample of observations, one can estimate the probability distribution. This is called density estimation. Maximum Likelihood Estimation (MLE) is a probabilistic framework that can be used for density estimation. • This involves formulating a function called likelihood function which is the conditional probability of observing the observed samples and distribution function with its parameters.
  • 34. • Gaussian Mixture Model and Expectation-Maximization (EM) Algorithm In machine learning, clustering is one of the important tasks. It is discussed in Chapter 13. MLE framework is quite useful for designing model-based methods for clustering data. A model is a statistical method and data is assumed to be generated by a distribution model with its parameter, 0. • There may be many distributions involved and that is why it is called as mixture model. Since, Gaussian are normally assumed for data, this mixture model is categorized as Gaussian Mixture Model (GMM). • The EM algorithm is one algorithm that is commonly used for estimating the MLE in the presence of latent or missing variables. What is a latent variable? Let us assume that the dataset includes weights of boys and girls. Considering the fact that the boys’ weights would be slightly more than the weights of the girls, one can assume that the larger weights are generated by one gaussian distribution with one set of parameters while girls’ weights are generated with another set of parameters. There is an influence of gender in the data, but it is not directly present or observable. These are called latent variables. The EM algorithm is effective for estimating the PDF in the presence of latent variables.
  • 35. • one can assume that the larger weights are generated by one gaussian distribution with one set of parameters while girls’ weights are generated with another set of parameters. There is an influence of gender in the data, but it is not directly present or observable. These are called latent variables. The EM algorithm is effective for estimating the PDF in the presence of latent variables • Generally, there can be many unspecified distributions with different set of parameters. The EM algorithm has two stages: • 1. Expectation (E) Stage — In this stage, the expected PDF and its parameters are estimated for each latent variable. • 2. Maximization (M) stage — In this, the parameters are optimized using the MLE function. This process is iterative, and the iteration is continued till all the latent variables are fitted by probability distributions effectively along with the parameters.
  • 36. • Non-parametric Density Estimation A non-parametric estimation can be generative or discriminative. Parzen window is a generative estimation method that finds p(x|Ɵ ) as conditional density. Discriminative methods directly compute p(Ɵ | x) as posteriori probability. Parzen window and k-Nearest Neighbour (KNN) rule are examples of non-parametric density estimation. Let us discuss about them now.
  • 38. • KNN Estimation The KNN estimation is another non-parametric density estimation method. Here, the initial parameter k is determined and based on that k-neighbours are determined. The probability density function estimate is the average of the values that are returned by the neighbours. 2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES • Features are attributes. Feature engineering is about determining the subset of features that form an important part of the input that improves the performance of the model, be it classification or any other model in machine learning. • Feature engineering deals with two problems — Feature Transformation and Feature Selection. Feature transformation is extraction of features and creating new features that may be helpful in increasing performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI).
  • 39. • Feature subset selection is another important aspect of feature engineering that focuses on selection of features to reduce the time but not at the cost of reliability. • The subset selection reduces the dataset size by removing irrelevant features and constructs a minimum set of attributes for machine learning. If the dataset has n attributes, then time complexity is extremely high as n dimensions need to be processed for the given dataset. • For n attributes, there are 2" possible subsets. If the value of n is high, the problem becomes intractable. This is called ‘curse of dimensionality’. Since, as the number of dimensions increases, the time complexity increases. The remedy is that some of the components that do not contribute much can be deleted. This results in the reduction of dimensionality. • Choosing optimal attributes becomes a graph search problem. Typically, the feature subset selection problem uses greedy approach by looking for the best choice at the time using locally optimal choice while hoping that it would lead to global optimal solutions
  • 40. • The features can be removed based on two aspects: • 1, Feature relevancy - Some features contribute more for classification than other features. For example, a mole on the face can help in face detection than common features like nose. In simple words, the features should be relevant. The relevancy of the features can be determined based on information measures such as mutual information, correlation based features like correlation coefficient and distance measures. Distance measures are discussed in Chapter 13 of this book. • 2. Feature redundancy - Some features are redundant. For example, when a database table has a field called Date of birth, then age field is not relevant as age can be computed easily from date of birth. This helps in removing the column age that leads to reduction of dimension one.
  • 41. • So, the procedure is: • 1. Generate all possible subsets • 2. Evaluate the subsets and model performance • 3. Evaluate the results for optimal feature selection • Filter-based selection uses statistical measures for assessing features. In this approach, noo learning algorithm is used. Correlation and information gain measures like mutual information and entropy are all examples of this approach. • Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by the learning algorithms. This procedure is computationally intensive but has superior performance.
  • 42. • Stepwise Forward Selection • This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical significance for best quality and is added to the reduced set. This process is continued till a Brod reduced set of attributes is obtained. • 2.10.2 Stepwise Backward Elimination • This procedure starts with a complete set of attributes. At every stage, the procedure removes the worst attribute from the set, leading to the reduced set. • Combined Approach Both forward and reverse methods can be combined so that the procedure can add the best attribute and remove the worst attribute.
  • 43. 2.10.3 Principal Component Analysis • The idea of the principal component analysis (PCA) or KL transform is to transform a given set of measurements to a new set of features so that the features exhibit high information packing properties. This leads to a reduced and compact set of features. Basically, this elimination is made possible because of the information redundancies. This compact representation is of a reduced dimension.
  • 46. • The advantages of PCA are immense. It reduces the attribute list by eliminating all irrelevant attributes. The PCA algorithm is as follows: • 1. The target dataset x is obtained • 2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is • X —m. The objective of this process is to transform the dataset with zero mean. • 3. The covariance of dataset x is obtained. Let it be C. • 4. Eigen values and eigen vectors of the covariance matrix are calculated.
  • 47. • 5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen values are arranged in a descending order. The feature vector is formed with these eigen vectors in its columns. Feature vector = {eigen vector,, eigen vector, ... , eigen vector,} • 6. Obtain the transpose of feature vector. Let it be A. • 7, PCA transform is y = A x (x ~ m), where x is the input dataset, m is the mean, and A is the transpose of the feature vector.
  • 52. • The process of acquiring knowledge and expertise through study, experience, or being taught j, called as learning. Generally, humans learn in different ways. To make machines learn, we need to simulate the strategies of human learning in machines. But, will the computers learn? This question has been raised over many centuries by philosophers, mathematicians and logicians. First let us address the question - What sort of tasks can the computers learn? This depends op, the nature of problems that the computers can solve, There are two kinds of problems — well-posed and ill-posed. Computers can solve only well posed problems, as these have well-defined specifications and have the following components inherent to it. • A Class of learning tasks (T) • A measure of performance (P) • A source of experience (E)
  • 53. • The standard definition of learning proposed by Tom Mitchell is that a program can learn from E for the task T, and P improves with experience E. • Let us formalize the concept of learning follows: Let x be the input and X be the input space, which is the set of all inputs, and Y is the output space, which is the set of all possible outputs, that is, yes/no. • Let D be the input dataset with examples, (x1,x2), (x2,x2) …(Xn,Yn) for n inputs. the unknown target function be f: X —> Y, that maps the input space to output space. The objective of the learning Program is to pick a function, g: X —> Y to approximate hypothesis f • All the possible formulae form a hypothesis space. In short, let H be the set of all formulae from which the learning algorithm chooses. The choice is good when the hypothesis g replicates f for all samples. This is shown in Figure 3.1.
  • 56. Classical and Adaptive Machine Learning Systems • A classical machine learning system has components such as Input, Process and Output. The input values are taken from the environment directly. These values are processed and a hypothesis is generated as output model. This model is then used for making predictions. The predicted values are consumed by the environment. • In contrast to the classical systems, adaptive systems interact with the input for getting labelled data as direct inputs are not available. This process is called reinforcement learning. • In reinforcement learning, a learning agent interacts with the environment and in return gets feedback. Based on the feedback, the learning agent generates input samples for learning, which are used for generating the learning model. Such learning agents are not static and change their behaviour according to the external signal received from the environment. • Learning Types
  • 57. Learning Types • There are different types of learning. Some of the different learning methods are as follows: • Learn by memorization or learn by repetition also called as rote learning is done by memorizing without understanding the logic or concept. Although rote learning is basically learning by repetition, in machine learning perspective, the learning occurs by simply comparing with the existing knowledge for the same input data and producing the output if present. • Learn by examples also called as learn by experience or previous knowledge acquired at some time, is like finding an analogy, which means performing inductive learning from observations that formulate a general concept. Here, the learner learns by inferring a general rule from the set of observations or examples. Therefore, inductive learning is also called as discovery learning
  • 58. • Learn by being taught by an expert or a teacher, generally called as passive learning, However, there is a special kind of learning called active learning where the learner c interactively query a teacher/expert to label unlabelled data instances with the design 4 outputs. learning, • Learning by critical thinking, also called as deductive learning, deduces new facts o, conclusion from related known facts and information. • Self also called as reinforcement learning, is a self-directed learning that normally learns from mistakes punishments and rewards. • the Learning to solve problems js a type of cognitive learning where learning happens in goal. mind and is possible by devising a methodology to achieve a goal. Here, the learner initially is not aware of the solution or the way to achieve the goal but only know, the The learning happens either directly from the initial state by following the steps to achieve the goal or indirectly by inferring the behaviour. • Learning by generalizing explanations, also called as explanation-based learning (EBL) another learning method that exploits domain knowledge from experts to improve the accuracy of learned concepts by supervised learning.
  • 59. DESIGN OF A LEARNING SYSTEM • A system that is built around a learning algorithm is called a learning system. • the design of systems focuses on these steps: • Choosing a training experience • Choosing a target function • Representation of a target function • Function approximation Training Experience • Let us consider designing of a chess game. In direct experience, individual board states and correct moves of the chess game are given directly. In indirect system, the move sequences and results are only given.
  • 60. • The training experience also depends on the presence of a supervisor who can label all valid moves for a board state. In the absence of a supervisor, the game agent plays against itself and learns the good moves, if the training samples cover all scenarios, or in other words, distributed enough for performance computation. If the training samples and testing samples have the same distribution, the results would be good Determine the Target Function • The next step is the determination of a target function. In this step, the type of knowledge that needs to be learnt is determined. In direct experience, a board move is selected and is determined whether it is a good move or not against all other moves. If it is the best move, then it is chosen as: B —> M, where, B and M are legal moves. • In indirect experience, all legal moves are accepted and score is generated for each. The move with. largest score is then chosen and executed.
  • 63. • INTRODUCTION TO CONCEPT LEARNING • Concept leaming is a learning strategy of acquiring abstract knowledge or inferring a genera] concept or deriving a category from the given training samples. It is a process of abstraction and generalization from the data. Concept learning helps to classify an object that has a set of common, relevant features. • The learner tries to simplify by observing the common features from the training samples and then apply this simplified model to the future samples. This task is also known as learning from experience. Each concept or category obtained by learning is a Boolean valued function which takes a true or false value. For example, humans can identify different kinds of animals based on common relevant features and categorize all animals based on specific sets of features.
  • 64. • The special features that distinguish one animal from another can be called as a concept. This way of learning categories for It is object and to recognize new instances of those categories is called as concept learning, formally defined as inferring a Boolean valued function by processing training instances, Concept learning requires three things: • Input — Training dataset which is a set of training instances, each labeled with the name of concept or category to which it belongs. Use this past experience to train and build the model. • Output — Target concept or Target function f It is a mapping function f(x) from input x to output y. It is to determine the specific features or common features to identify an object. In other words, it is to find the hypothesis to determine the target concept. For e.g., the specific set of features to identify an elephant from all animals. • Test — New instances to test the learned model.
  • 65. • Formally, Concept learning is defined as—"Given a set of hypotheses, the learner searches through the hypothesis space to identify the best hypothesis that matches the target concept”. Consider the following set of training instances shown in Table 3.1. • Here, in this set of training instances, the independent attributes considered are ‘Horns’, ‘Tail’, ‘Tusks’, ‘Paws’, ‘Fur’, ‘Color’, ‘Hooves’ and ‘Size’. The dependent attribute is ‘Elephant’. • The target concept is to identify the animal to be an Elephant. Let us now take this example and understand further the concept of hypothesis. Target Concept: Predict the type of animal - For example —Elephant
  • 69. Hypothesis Space • Hypothesis space is the set of all possible hypotheses that approximates the target function f. • In other words, the set of all possible approximations of the target function can be defined as hypothesis space. From this set of hypotheses in the hypothesis space, a machine learning algorithm would determine the best possible hypothesis that would best describe the target function or best fit the outputs. • Generally, a hypothesis representation language represents a larger hypothesis space. Every machine learning algorithm would represent the hypothesis space in a different manner about the function that maps the input variables to output variables.
  • 70. • For example, a regression algorithm represents the hypothesis space as a linear function whereas a decision tree algorithm represents the hypothesis space as a tree. • The set of hypotheses that can be generated by a learning algorithm can be further reduced by specifying a language bias. • The subset of hypothesis space that is consistent with all-observed training instances is called as Version Space. Version space represents the only hypotheses that are used for the classification. • For example, each of the attribute given in the Table 3.1 has the following possible set of values.
  • 71. • Considering these values for each of the attribute, there are (2x2 x2x2x2x3x2 x2) =384 distinct instances covering all the 5 instances in the training dataset. • So, we can generate (4x4x4x4x4x5x4x4) =81,920 distinct hypotheses when including two more values [?, Φ] for each of the attribute. However, any hypothesis containing one or more
  • 72. • Φ symbols represents the empty set of instances; that is, it classifies every instance as negative instance. • Therefore, there will be (8x3x3x3x3x4x3x3 +1) =8,749 distinct hypotheses by including only ‘?’ for each of the attribute and one hypothesis representing the empty set of instances. Thus, the hypothesis space is much larger and hence we need efficient learning algorithms to search for the best hypothesis from the set of hypotheses. • Hypothesis ordering is also important wherein the hypotheses are ordered from the most specific one to the most general one in order to restrict searching the hypothesis space exhaustively.
  • 73. Heuristic Space Search • Heuristic search is a search strategy that finds an optimized hypothesis/solution to a problem by iteratively improving the hypothesis/solution based on a given heuristic function or a cost measure. • Heuristic search methods will generate a possible hypothesis that can be a solution in the hypothesis Space or a path from the initial state. This hypothesis will be tested with the target function or goal state to see if its real. • Several commonly used heuristic search methods are hill climbing methods, constraint satisfaction problems, best-first search, simulated-annealing, A* algorithm, and genetic algorithms,
  • 74. Generalization and Specialization • In order to understand about how we construct this concept hierarchy, let us apply this genera] principle of generalization/specialization relation, By generalization of the most specific hypothesis and by specialization of the most general hypothesis, the hypothesis space can be searched for an approximate hypothesis that matches all positive instances but does not match any negative instance. Searching the Hypothesis Space • There are two ways of learning the hypothesis, consistent with all training instances from the large • hypothesis space. • 1. Specialization —- General to Specific learning • 2. Generalization — Specific to General learning
  • 75. • Generalization – Specific to General Learning This learning methodology will search through the hypothesis space for an approximate hypothesis by generalizing the most specific hypothesis.
  • 77. • Specialization - General to Specific Learning This learning methodology will search through the hypothesis space for an approximate hypothesis by specializing the most general hypothesis.
  • 79. Hypothesis Space Search by Find-S Algorithm • Find-S algorithm is guaranteed to converge to the most specific hypothesis in H that is consistent with the positive instances in the training dataset. Obviously, it will also be consistent with the negative instances. • Thus, this algorithm considers only the positive instances and eliminates negative instances while generating the hypothesis. It initially starts with the most specific hypothesis.
  • 82. Limitations of Find-S Algorithm • 1. Find-S algorithm tries to find a hypothesis that is consistent with positive instances, ignoring all negative instances. As long as the training dataset is consistent, the hypothesis found by this algorithm may be consistent. • 2. The algorithm finds only one unique hypothesis, wherein there may be many other hypotheses that are consistent with the training dataset • 3.Many times, the training dataset may contain some errors; hence such inconsistent data may mislead this algorithm in determining consistent hypothesis.
  • 83. Version Spaces • The version Space contains the subset of hypotheses from the hypothesis space that is consistent with all training instances in the training dataset. List-Then-Eliminate Algorithm • The principle idea of this learning algorithm is to initialize the version space to contain all hypotheses and then eliminate any hypothesis that is found inconsistent with any training instances. Initially, the algorithm starts with a version space to contain all hypotheses scanning each training instance. • The hypotheses that are inconsistent with the training instance are eliminated. Finally, the algorithm outputs the list of remaining hypotheses that are all consistent.
  • 84. Version Spaces and the Candidate Elimination Algorithm • Version space learning is to generate all consistent hypotheses around. This algorithm computes the version space by the combination of the two cases namely, • * Specific to General learning — Generalize S to include the positive example • * General to Specific learning — Specialize G to exclude the negative example • Using the Candidate Elimination algorithm, we can compute the version space containing all (and only those) hypotheses from H that are consistent with the given observed sequence of training instances.
  • 85. • The algorithm defines two boundaries called ‘general boundary’ which is a set of all hypotheses that are the most general and ‘specific boundary’ which is a set of all hypotheses that are the most specific. Thus, the algorithm limits the version space to contain only those hypotheses that are most general and most specific. Thus, it provides a compact representation of List-then algorithm.
  • 90. INDUCTION BIASES • Induction is a process of learning a target function or generalizing training data into a general model. • Inductive bias is the set of prior assumptions considered by a learning algorithm beyond the training data, in order to perform induction. It is also called as the bias of the algorithm. • It is similar to prior knowledge used for learning new concepts. For example, linear regression learning assumes that the predictors (independent variables) are related to the target variable. Similarly, C4.5 decision tree learning algorithm assumes to always choose greedy best attributes as split criterion when constructing the decision tree. There are two types of bias: – 1. Constraint or Restriction — limit the hypothesis space – 2. Preference — impose ordering on hypothesis space
  • 91. • Bias and Variance • In supervised machine learning models, we need to find the target function f(x) for an input value ‘y’ that best maps the predicted output values to actual output values. If the predicted output value deviates from the actual output value, then we call it an error. There are three kinds of prediction errors—Bias error, Variance error and Irreducible error. • 1. Irreducible errors cannot be reduced or avoided which normally happens because of various factors like unknown variables, noise, etc. These errors can sometimes be avoided by data cleaning. • 2. A Bias error is the difference between the predicted output value and the actual output value of any learning model. Bias defines the accuracy of model predictions or the Mean Square Error (MSE) in the predictions. This bias error occurs due to inaccurate and simplifying assumptions or under-fitting during learning with the training data. The error is prominent when the test data is provided to the learned model.
  • 92. • 3. A Variance error is the change in the target function sensitive to outliers and noise or small fluctuation in the training dataset. This change in the estimated target function occurs due to outliers, noise and when the number and types of parameters used to derive the mapping function change with different training sets. On the other hand, variance is the variations or spread from the actual value that can be seen between many models’ predictions estimated from different training sets. Pictorial representation of bias and variance is shown in Figure 3.3.
  • 93. Bias vs Variance Tradeoff
  • 94. Best Fit in Machine Learning • Generalizing the hypothesis from training instances to a specific model is called inductive learning. • Generalization basically describes the model's ability to infer or predict correctly a new unseen data after being trained with a training dataset. If a model is over trained or under trained, then its predictions are not going to be accurate. • Generally, performance of a machine learning model or predictions deteriorate when it learns too much or too less from the training instances. When a machine learning model learns too much from the training instances including noise, overfitting occurs. In other words, the model tries to fit the training data instances too well or predictors/features/independent variables are too complex. • Moreover, when a model performs very well on the training data but poorly on the test data, then it is also an overfitting problem.
  • 95. • Overfitting occurs more likely with non-parametric and non-linear models that have more flexibility when learning a target function. • Underfitting generally occurs when a machine learning model could not learn from the training instances or the instances do not match with the model to learn or when predictors are very simple. In other words, the model does not fit the data well enough. • Ideally, the goal of selecting a machine learning model is to provide a performance in predictions between underfitting and overfitting. Hence, it is essential to find a model that provides a good fit but practically it is very difficult to achieve.
  • 96. • Model Selection and Model Evaluation • The biggest challenge in machine learning is choosing an algorithm that suits the problem. Hence, model selection and assessment are very important and deal with two types of complexities. • 1. Model Performance — How well the model performs on the training dataset? • 2. Model Complexity - How much complexity the model possesses after the training phase is over? • Model Selection is a process of selecting one good enough model among different machine learning models for the dataset or selecting different sets of features or hyper parameters for the same machine learning model. It is difficult to find the best model because all models exhibit some predictive error for the problem, so at least a good enough model should be selected that performs fairly well with the dataset.
  • 97. • Some of the approaches used for selecting a machine learning model are listed below: • 1. Use resample methods and split the dataset as training, testing and validation datasets and observe the performance of the model over all the phases. This approach is suitable for smaller datasets. • 2. The simplest approach is to fit a model on the training dataset and to compute measures like error or accuracy. • 3, The use of probabilistic framework and quantification of the performance of the model as a score is the third approach. • These methods are discussed in the following sections.