SlideShare a Scribd company logo
Linear Regression
Data Mining problems
• Data mining problems are often divided into Predictive tasks and
Descriptive tasks.
• Predictive Analytics (Supervised learning):
Given observed data (x1,y1), (x2,y2),..., (xn, yn) learn a model to predict Y
from X.
 If Yi is a continuous numeric value, this task is called prediction (E.g., Yi
= stock price, income, survival time)
 If Yi is a discrete or symbolic value, this task is called classification
(E.g., Yi ϵ {0, 1}, Yi ϵ{spam, email}, Yi ϵ {1, 2, 3, 4})
• Descriptive Analytics (Unsupervised learning):
Given data x1,x2,..xn, identify some underlying patterns or structure in the
data.
Regression in data mining
• Predict real-valued output for given input-given a training set
– Example:
• Predict rainfall in cm for month
• Predict stock price in next day
• Predict number of users who will click on an internet
advertisement
• Classification problem
– A set of predefined categories/classes
– Training examples with attribute as well as class information
available- supervised learning
– Classification task-predict class table for a new example-
predictive mining
• Clustering task:
– No predefined classes
– Attempt to find homogeneous groups in data -exploratory data
mining
– Training examples have attribute values only
– No class information is available
• Regression:
– it is predictive data mining
– for attribute values of an example you have to predict the output
– output is not a class
– output is a real value
– supervised learning
Linear Regression
• Linear regression aims to predict the response Y by estimating the best
linear predictor: the linear function that is closest to the true regression
function f.
• Task: predict real valued Y, given real valued vector x using a regression
model f.
• Error function, e.g., least squares is often used.
• Why is this usually called linear regression?
– Model is linear in the parameters
• Goal: Function f applied to training data should produce values as close as
possible in aggregate to actual outputs
• For example
– xi=temperature today
– Yi=rainfall volume tomorrow
• Another example:
– Xi=temperature today
– Yi=traffic density
• Training set consists of pairs (x1,y1), (x2,y2),..., (xn,yn). And
regression task is predict value of yn+1 for xn+1.
• When x has a single value called Univariate regression
• Multivariate regression:
– Training set:
– There is a single output y but there are multiple input x1,x2.
Example: Predict temperature of a place based on humidity and
pressure.
– There can be multiple output also.
• Regression model:
Y=f(x1,x2,..xn)  multivariate
Y=f(x)  univariate
Y=output dependent variable
x1, x2,...,xn input or the independent variable
f: regression function or model
)
,
,
),....(
,
,
(
),
,
,
( 1
1
2
1
1
2
2
2
2
1
1
1
2
1
1 


n
n
n
y
x
x
y
x
x
y
x
x
• The model f determines how the dependent variable y
depends on the independent variable x.
• Linear regression:
f is a linear function:
In general for linear regression:
Where a0 , a1,a2, an are the regression coefficients.
• Univariate case line
• Multivariate case plane
)
,...,
,
( 2
1 n
x
x
x
f
y 
n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0
1
1
0 x
a
a
y 

n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0
• Given :
Find a0, a1, such that
So that the line best fits the given data
a1,a2 are the slopes of the regression and a0 is the bias or axis intercept.
• Training a regression model:
Given : training set:
– Find the values of the regression coefficients that best matches /fits
the training data
– Univariate regression:
– Finds values of a0, a1 such that the line best fits the data.
)
,
(
),...,
,
(
),
,
( 2
2
1
1 n
n y
x
y
x
y
x
n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0
1
1
0 x
a
a
y 

)
,
,...,
,
),...,
,
,...,
,
(
),
,
,...,
,
( 2
1
2
2
2
2
2
1
1
1
1
2
1
1 n
n
k
n
n
k
k y
x
x
x
y
x
x
x
y
x
x
x
Least square error
• To find a line having the least error
• Define an error function of a line
• So define error function
• Where ei=difference between actual value of yi and model predicted value
of yi
• For a given value xi actual value is yi, predicted value is a0+a1xi
• So, error
• for univariate
• Here square is taken in error function as equal importance is given for
positive and negative, both are equally bad.



n
i
i
e
SSE
1
2
)]
(
[ 1
0 i
i
i x
a
a
y
e 


 
2
1
1
0
1
2
)
(

 





n
i
i
i
n
i
i x
a
a
y
e
S
• For multivariate,
• Find values of regression coefficients a0, a1 , ... such that sum square error
is minimised
• Predictions based on this equation are the best predictions possible in the
sense that they will be unbiased (equal to the true values on the average)
and will have the smallest expected squared error compared to any
unbiased estimates under the following assumptions.
– Linearity of the relationship between dependent and independent variables
– Statistical independence of the errors
– Homoskedasticity or constant variance of the errors
– Normality of the error distribution
 
2
1
2
2
1
1
0
1
2
)
...
(

 








n
i
ik
k
i
i
i
n
i
i x
a
x
a
x
a
a
y
e
S
Linear Regression
 
 
 
e
e
e
i
X
f
i
y
a
a
a
a
x
a
a
y
X
f
x
a
x
a
x
a
a
y
i
i
i
k
k
i
k
k
k
k


















)
S(
);
(
)
(
)
S(
:
function
Error
parameter
model
]
,...,
,
,
[
structure
model
)
:
(
...
2
2
2
1
0
1
0
2
2
1
1
0





X
y
form
in the
written
be
can
model
regression
linear
the
notation,
compact
With this
1
1
1
X
,
2
1
1
0
1
2
21
1
11
2
1
e
e
e
e
e
a
a
a
x
x
x
x
x
x
y
y
y
y
n
p
np
n
p
p
n






































































.
estimators
(OLS)
squares
least
ordinary
the
as
called
usually
or
,
estimators
regression
direct
the
called
is
equation
the
of
solutions
The
X)
X
(
estimates
final
get the
we
solving
By
)
(
)
(
squares
of
sum
residual
the
minimize
that
values
parameter
the
finding
by
parameters
the
estimates
regression
Linear
1
-
1
2
y
X
X
y
X
y
e
e
e
n
i
i















•
• is defined for training data.
• We are really interested in finding the best predicts y on
future data, i.e., minimising sum of squared error where the
expectation is over future data.
• This is known as Empirical learning which is based on data on
experiment. We are interested not only to minimise on the
training data and but also to get the best prediction on
unknown future data.
value
predicted
model
value
Actual
)
( 


S
)
(
S
• The usual assumption is the way that past data behaved future data will
also behave similarly.
– If we have a model which minimises error on past data it will also
minimise the error on future data.
– If training data is large the model is simple, we are assuming that the
best f on training data is also the best predictor f on future test data.
Limitation of Linear regression
• True relationship of X and Y might be non-linear
– suggests generalisations to non-linear models
• Complexity:
– cost of computational operation and time complexity increases with
number of attributes
• Correlation/Co-linearity among the X variables
– can cause numerical instability ( inverse does not exist if matrix is not
a full ranked)
– problems in interpretability (identifiability: determining whether the
model true parameters may be recovered from the observed data)
• Includes all variables in the model..
– But what if attributes =1000 and only 3 variables are actually related
to Y?
Complexity vs. goodness of fit
• Suppose the regression model is a
linear and it is too simple
– Simple model does not fit the data
well has large training set error
– A biased solution
• Suppose large data on training data
itself makes model more complex
nonlinear regression model
– Complex model has low training set
error but high error on future points
causes overfitting
– Small changes to the data, solution
changes a lot
– A high--‐variance solution
• Occam’s Razor principle (Principle of Parsimony):
– The principle states that "Entities should not be multiplied
unnecessarily.“
– "when you have two competing theories that make exactly the
same predictions, the simpler one is the better."
– Use the simplest model which gives acceptable accuracy on
training set –do not complicate the model to overfit the training
data
• Choose the model which sacrifice some training set errors for
better performance on future sample.
• Penalize complex models based on
– Prior information (bias)
– Information Criterion (MDL, AIC, BIC)
Bias and variance for regression
• For regression, we can easily decompose the error of the learned model into two
parts: bias (error 1) and variance (error 2)
• Bias:
– The difference between the average prediction of our model and the correct
value which we are trying to predict.
– How much does the mean of the predictor differ from the optimal predictor
• Variance:
– The variability of model prediction for a given data point or a value which tells
us spread of our data.
– How much does the predictor vary about its mean for different training
datasets
– The variance of a learning algorithm is a measure of its precision. High
variance error of a model implies that it is highly sensitive to small
fluctuations.
Linear Regression for Data Mining Application
Training and Test Error
• Given a dataset, training data used to fit the parameters of the
model. Training data choose a loss function e.g., squared error for
regression.
• The training error is the mean error over the training sample.
• The test (or generalization) error is the expected prediction error
over an independent test sample.
• Prediction error or true (generalization) error (over the whole
population) is for the target performance measure, i.e.,
performance on a random test point (X,Y).
• Training error is not a good estimator for test error.
Model Complexity and Generalization
• A models ability to adapt to patterns in the data, we call the model
complexity.
• A model with greater complexity might be theoretically more accurate
(i.e., low bias).
– But you have less control on what it might predict on a tiny training
data set.
– Different training data sets will result in widely varying predictions of
same test instance.
• Generalization ability: We want good predictions on new data, i.e.,
‘generalization’. What is the out-of-sample error of learner f ?
• Training error can be reduced by making the hypothesis more sensitive to
training data, but this may lead to overfitting and poor generalization.
Model Selection and Assessment
• When we want to estimate test error, we may have two different goals in
mind:
1. Model selection: Estimate the performance of different hypotheses or
algorithms in order to choose the (approximately) best one.
2. Model assessment: Having chosen a final hypothesis or algorithm,
estimate its generalization error on new data.
• Trade-off between bias and variance:
– Simple Models: High Bias, Low Variance
– Complex Models: Low Bias, High Variance
• Thus, a designer is virtually always confronted to the following
dilemma:
– On one hand, if the model is too simple, it will give a poor
approximation of the phenomenon (underfitting).
– On the other hand, if the model is too complex, it will be able to
fit exactly the examples available, without finding a consistent
way of modelling (overfitting).
• Choice of models balances bias and variance.
– Over‐fitting  Variance is too High
– Under‐fitting  Bias is too High
Training, Validation and Test Data
• In a data-rich situation, the best approach to both model selection and
model assessment is to randomly divide the dataset into three parts:
1. A training set used to fit the models.
2. A validation set (or development test set) used to estimate test error for
model selection.
3. A test set (or evaluation test set) used for assessment of the
generalization error of the finally chosen model.
• Training: train different models
• Validation: evaluate different models
• Test: evaluate the accuracy of the final model
The trained model can then be used to make predictions on
unseen observations

More Related Content

PDF
Lecture 5 - Linear Regression Linear Regression
PDF
Module 5.pdf Machine Learning Types and examples
PDF
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
PPTX
06-01 Machine Learning and Linear Regression.pptx
PPTX
regression analysis presentation slides.
PPTX
Unit-1 Introduction and Mathematical Preliminaries.pptx
PPTX
Unit 3 – AIML.pptx
PPTX
SET-02_SOCS_ESE-DEC23__B.Tech%20(CSE-H+NH)-AIML_5_CSAI3001_Neural%20Networks.pdf
Lecture 5 - Linear Regression Linear Regression
Module 5.pdf Machine Learning Types and examples
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
06-01 Machine Learning and Linear Regression.pptx
regression analysis presentation slides.
Unit-1 Introduction and Mathematical Preliminaries.pptx
Unit 3 – AIML.pptx
SET-02_SOCS_ESE-DEC23__B.Tech%20(CSE-H+NH)-AIML_5_CSAI3001_Neural%20Networks.pdf

Similar to Linear Regression for Data Mining Application (20)

PPTX
Computational Finance Introductory Lecture
PPTX
Data Science and Machine Learning with Tensorflow
PPTX
2a-linear-regression-18Maykjkij;oik;.pptx
PDF
Machine learning Introduction
PPTX
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
PPTX
Ml ppt at
PDF
Machine Learning Notes for beginners ,Step by step
PPTX
Regression.pptx
PDF
3ml.pdf
PPTX
Fundamentals of Data Science Modeling Lec
PPTX
Training and Testing Neural Network unit II
PPTX
Machine Learning - Dataset Preparation
PPT
Machine Learning Unit 2_Supervised Learning
PPTX
UNIT 3.pptx.......................................
PPTX
Lec4(Multiple Regression) & Building a Model & Dummy Variable.pptx
PPTX
Deeplearning for Computer Vision PPT with
PPTX
cs 601 - lecture 1.pptx
PPT
Machine Learning
PPTX
Machine learning
PPT
Machine learning introduction to unit 1.ppt
Computational Finance Introductory Lecture
Data Science and Machine Learning with Tensorflow
2a-linear-regression-18Maykjkij;oik;.pptx
Machine learning Introduction
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Ml ppt at
Machine Learning Notes for beginners ,Step by step
Regression.pptx
3ml.pdf
Fundamentals of Data Science Modeling Lec
Training and Testing Neural Network unit II
Machine Learning - Dataset Preparation
Machine Learning Unit 2_Supervised Learning
UNIT 3.pptx.......................................
Lec4(Multiple Regression) & Building a Model & Dummy Variable.pptx
Deeplearning for Computer Vision PPT with
cs 601 - lecture 1.pptx
Machine Learning
Machine learning
Machine learning introduction to unit 1.ppt
Ad

Recently uploaded (20)

PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Lecture1 pattern recognition............
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Introduction to Business Data Analytics.
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Lecture1 pattern recognition............
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Business Data Analytics.
oil_refinery_comprehensive_20250804084928 (1).pptx
Moving the Public Sector (Government) to a Digital Adoption
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Launch Your Data Science Career in Kochi – 2025
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Ad

Linear Regression for Data Mining Application

  • 2. Data Mining problems • Data mining problems are often divided into Predictive tasks and Descriptive tasks. • Predictive Analytics (Supervised learning): Given observed data (x1,y1), (x2,y2),..., (xn, yn) learn a model to predict Y from X.  If Yi is a continuous numeric value, this task is called prediction (E.g., Yi = stock price, income, survival time)  If Yi is a discrete or symbolic value, this task is called classification (E.g., Yi ϵ {0, 1}, Yi ϵ{spam, email}, Yi ϵ {1, 2, 3, 4}) • Descriptive Analytics (Unsupervised learning): Given data x1,x2,..xn, identify some underlying patterns or structure in the data.
  • 3. Regression in data mining • Predict real-valued output for given input-given a training set – Example: • Predict rainfall in cm for month • Predict stock price in next day • Predict number of users who will click on an internet advertisement
  • 4. • Classification problem – A set of predefined categories/classes – Training examples with attribute as well as class information available- supervised learning – Classification task-predict class table for a new example- predictive mining • Clustering task: – No predefined classes – Attempt to find homogeneous groups in data -exploratory data mining – Training examples have attribute values only – No class information is available
  • 5. • Regression: – it is predictive data mining – for attribute values of an example you have to predict the output – output is not a class – output is a real value – supervised learning
  • 6. Linear Regression • Linear regression aims to predict the response Y by estimating the best linear predictor: the linear function that is closest to the true regression function f. • Task: predict real valued Y, given real valued vector x using a regression model f. • Error function, e.g., least squares is often used. • Why is this usually called linear regression? – Model is linear in the parameters • Goal: Function f applied to training data should produce values as close as possible in aggregate to actual outputs
  • 7. • For example – xi=temperature today – Yi=rainfall volume tomorrow • Another example: – Xi=temperature today – Yi=traffic density • Training set consists of pairs (x1,y1), (x2,y2),..., (xn,yn). And regression task is predict value of yn+1 for xn+1. • When x has a single value called Univariate regression
  • 8. • Multivariate regression: – Training set: – There is a single output y but there are multiple input x1,x2. Example: Predict temperature of a place based on humidity and pressure. – There can be multiple output also. • Regression model: Y=f(x1,x2,..xn)  multivariate Y=f(x)  univariate Y=output dependent variable x1, x2,...,xn input or the independent variable f: regression function or model ) , , ),....( , , ( ), , , ( 1 1 2 1 1 2 2 2 2 1 1 1 2 1 1    n n n y x x y x x y x x
  • 9. • The model f determines how the dependent variable y depends on the independent variable x. • Linear regression: f is a linear function: In general for linear regression: Where a0 , a1,a2, an are the regression coefficients. • Univariate case line • Multivariate case plane ) ,..., , ( 2 1 n x x x f y  n n x a x a x a a y      ... 2 2 1 1 0 1 1 0 x a a y   n n x a x a x a a y      ... 2 2 1 1 0
  • 10. • Given : Find a0, a1, such that So that the line best fits the given data a1,a2 are the slopes of the regression and a0 is the bias or axis intercept. • Training a regression model: Given : training set: – Find the values of the regression coefficients that best matches /fits the training data – Univariate regression: – Finds values of a0, a1 such that the line best fits the data. ) , ( ),..., , ( ), , ( 2 2 1 1 n n y x y x y x n n x a x a x a a y      ... 2 2 1 1 0 1 1 0 x a a y   ) , ,..., , ),..., , ,..., , ( ), , ,..., , ( 2 1 2 2 2 2 2 1 1 1 1 2 1 1 n n k n n k k y x x x y x x x y x x x
  • 11. Least square error • To find a line having the least error • Define an error function of a line • So define error function • Where ei=difference between actual value of yi and model predicted value of yi • For a given value xi actual value is yi, predicted value is a0+a1xi • So, error • for univariate • Here square is taken in error function as equal importance is given for positive and negative, both are equally bad.    n i i e SSE 1 2 )] ( [ 1 0 i i i x a a y e      2 1 1 0 1 2 ) (         n i i i n i i x a a y e S
  • 12. • For multivariate, • Find values of regression coefficients a0, a1 , ... such that sum square error is minimised • Predictions based on this equation are the best predictions possible in the sense that they will be unbiased (equal to the true values on the average) and will have the smallest expected squared error compared to any unbiased estimates under the following assumptions. – Linearity of the relationship between dependent and independent variables – Statistical independence of the errors – Homoskedasticity or constant variance of the errors – Normality of the error distribution   2 1 2 2 1 1 0 1 2 ) ... (            n i ik k i i i n i i x a x a x a a y e S
  • 13. Linear Regression       e e e i X f i y a a a a x a a y X f x a x a x a a y i i i k k i k k k k                   ) S( ); ( ) ( ) S( : function Error parameter model ] ,..., , , [ structure model ) : ( ... 2 2 2 1 0 1 0 2 2 1 1 0     
  • 16. • • is defined for training data. • We are really interested in finding the best predicts y on future data, i.e., minimising sum of squared error where the expectation is over future data. • This is known as Empirical learning which is based on data on experiment. We are interested not only to minimise on the training data and but also to get the best prediction on unknown future data. value predicted model value Actual ) (    S ) ( S
  • 17. • The usual assumption is the way that past data behaved future data will also behave similarly. – If we have a model which minimises error on past data it will also minimise the error on future data. – If training data is large the model is simple, we are assuming that the best f on training data is also the best predictor f on future test data.
  • 18. Limitation of Linear regression • True relationship of X and Y might be non-linear – suggests generalisations to non-linear models • Complexity: – cost of computational operation and time complexity increases with number of attributes • Correlation/Co-linearity among the X variables – can cause numerical instability ( inverse does not exist if matrix is not a full ranked) – problems in interpretability (identifiability: determining whether the model true parameters may be recovered from the observed data) • Includes all variables in the model.. – But what if attributes =1000 and only 3 variables are actually related to Y?
  • 19. Complexity vs. goodness of fit • Suppose the regression model is a linear and it is too simple – Simple model does not fit the data well has large training set error – A biased solution • Suppose large data on training data itself makes model more complex nonlinear regression model – Complex model has low training set error but high error on future points causes overfitting – Small changes to the data, solution changes a lot – A high--‐variance solution
  • 20. • Occam’s Razor principle (Principle of Parsimony): – The principle states that "Entities should not be multiplied unnecessarily.“ – "when you have two competing theories that make exactly the same predictions, the simpler one is the better." – Use the simplest model which gives acceptable accuracy on training set –do not complicate the model to overfit the training data • Choose the model which sacrifice some training set errors for better performance on future sample. • Penalize complex models based on – Prior information (bias) – Information Criterion (MDL, AIC, BIC)
  • 21. Bias and variance for regression • For regression, we can easily decompose the error of the learned model into two parts: bias (error 1) and variance (error 2) • Bias: – The difference between the average prediction of our model and the correct value which we are trying to predict. – How much does the mean of the predictor differ from the optimal predictor • Variance: – The variability of model prediction for a given data point or a value which tells us spread of our data. – How much does the predictor vary about its mean for different training datasets – The variance of a learning algorithm is a measure of its precision. High variance error of a model implies that it is highly sensitive to small fluctuations.
  • 23. Training and Test Error • Given a dataset, training data used to fit the parameters of the model. Training data choose a loss function e.g., squared error for regression. • The training error is the mean error over the training sample. • The test (or generalization) error is the expected prediction error over an independent test sample. • Prediction error or true (generalization) error (over the whole population) is for the target performance measure, i.e., performance on a random test point (X,Y). • Training error is not a good estimator for test error.
  • 24. Model Complexity and Generalization • A models ability to adapt to patterns in the data, we call the model complexity. • A model with greater complexity might be theoretically more accurate (i.e., low bias). – But you have less control on what it might predict on a tiny training data set. – Different training data sets will result in widely varying predictions of same test instance. • Generalization ability: We want good predictions on new data, i.e., ‘generalization’. What is the out-of-sample error of learner f ? • Training error can be reduced by making the hypothesis more sensitive to training data, but this may lead to overfitting and poor generalization.
  • 25. Model Selection and Assessment • When we want to estimate test error, we may have two different goals in mind: 1. Model selection: Estimate the performance of different hypotheses or algorithms in order to choose the (approximately) best one. 2. Model assessment: Having chosen a final hypothesis or algorithm, estimate its generalization error on new data. • Trade-off between bias and variance: – Simple Models: High Bias, Low Variance – Complex Models: Low Bias, High Variance
  • 26. • Thus, a designer is virtually always confronted to the following dilemma: – On one hand, if the model is too simple, it will give a poor approximation of the phenomenon (underfitting). – On the other hand, if the model is too complex, it will be able to fit exactly the examples available, without finding a consistent way of modelling (overfitting).
  • 27. • Choice of models balances bias and variance. – Over‐fitting  Variance is too High – Under‐fitting  Bias is too High
  • 28. Training, Validation and Test Data • In a data-rich situation, the best approach to both model selection and model assessment is to randomly divide the dataset into three parts: 1. A training set used to fit the models. 2. A validation set (or development test set) used to estimate test error for model selection. 3. A test set (or evaluation test set) used for assessment of the generalization error of the finally chosen model.
  • 29. • Training: train different models • Validation: evaluate different models • Test: evaluate the accuracy of the final model The trained model can then be used to make predictions on unseen observations