Linear Regression for Data Mining Application

Data Mining problems
• Data mining problems are often divided into Predictive tasks and
Descriptive tasks.
• Predictive Analytics (Supervised learning):
Given observed data (x1,y1), (x2,y2),..., (xn, yn) learn a model to predict Y
from X.
 If Yi is a continuous numeric value, this task is called prediction (E.g., Yi
= stock price, income, survival time)
 If Yi is a discrete or symbolic value, this task is called classification
(E.g., Yi ϵ {0, 1}, Yi ϵ{spam, email}, Yi ϵ {1, 2, 3, 4})
• Descriptive Analytics (Unsupervised learning):
Given data x1,x2,..xn, identify some underlying patterns or structure in the
data.

Regression in data mining
• Predict real-valued output for given input-given a training set
– Example:
• Predict rainfall in cm for month
• Predict stock price in next day
• Predict number of users who will click on an internet
advertisement

• Classification problem
– A set of predefined categories/classes
– Training examples with attribute as well as class information
available- supervised learning
– Classification task-predict class table for a new example-
predictive mining
• Clustering task:
– No predefined classes
– Attempt to find homogeneous groups in data -exploratory data
mining
– Training examples have attribute values only
– No class information is available

• Regression:
– it is predictive data mining
– for attribute values of an example you have to predict the output
– output is not a class
– output is a real value
– supervised learning

Linear Regression
• Linear regression aims to predict the response Y by estimating the best
linear predictor: the linear function that is closest to the true regression
function f.
• Task: predict real valued Y, given real valued vector x using a regression
model f.
• Error function, e.g., least squares is often used.
• Why is this usually called linear regression?
– Model is linear in the parameters
• Goal: Function f applied to training data should produce values as close as
possible in aggregate to actual outputs

• For example
– xi=temperature today
– Yi=rainfall volume tomorrow
• Another example:
– Xi=temperature today
– Yi=traffic density
• Training set consists of pairs (x1,y1), (x2,y2),..., (xn,yn). And
regression task is predict value of yn+1 for xn+1.
• When x has a single value called Univariate regression

• Multivariate regression:
– Training set:
– There is a single output y but there are multiple input x1,x2.
Example: Predict temperature of a place based on humidity and
pressure.
– There can be multiple output also.
• Regression model:
Y=f(x1,x2,..xn)  multivariate
Y=f(x)  univariate
Y=output dependent variable
x1, x2,...,xn input or the independent variable
f: regression function or model
)
,
,
),....(
,
,
(
),
,
,
( 1
1
2
1
1
2
2
2
2
1
1
1
2
1
1 


n
n
n
y
x
x
y
x
x
y
x
x

• The model f determines how the dependent variable y
depends on the independent variable x.
• Linear regression:
f is a linear function:
In general for linear regression:
Where a0 , a1,a2, an are the regression coefficients.
• Univariate case line
• Multivariate case plane
)
,...,
,
( 2
1 n
x
x
x
f
y 
n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0
1
1
0 x
a
a
y 

n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0

• Given :
Find a0, a1, such that
So that the line best fits the given data
a1,a2 are the slopes of the regression and a0 is the bias or axis intercept.
• Training a regression model:
Given : training set:
– Find the values of the regression coefficients that best matches /fits
the training data
– Univariate regression:
– Finds values of a0, a1 such that the line best fits the data.
)
,
(
),...,
,
(
),
,
( 2
2
1
1 n
n y
x
y
x
y
x
n
n x
a
x
a
x
a
a
y 



 ...
2
2
1
1
0
1
1
0 x
a
a
y 

)
,
,...,
,
),...,
,
,...,
,
(
),
,
,...,
,
( 2
1
2
2
2
2
2
1
1
1
1
2
1
1 n
n
k
n
n
k
k y
x
x
x
y
x
x
x
y
x
x
x

Least square error
• To find a line having the least error
• Define an error function of a line
• So define error function
• Where ei=difference between actual value of yi and model predicted value
of yi
• For a given value xi actual value is yi, predicted value is a0+a1xi
• So, error
• for univariate
• Here square is taken in error function as equal importance is given for
positive and negative, both are equally bad.



n
i
i
e
SSE
1
2
)]
(
[ 1
0 i
i
i x
a
a
y
e 


 
2
1
1
0
1
2
)
(

 





n
i
i
i
n
i
i x
a
a
y
e
S

• For multivariate,
• Find values of regression coefficients a0, a1 , ... such that sum square error
is minimised
• Predictions based on this equation are the best predictions possible in the
sense that they will be unbiased (equal to the true values on the average)
and will have the smallest expected squared error compared to any
unbiased estimates under the following assumptions.
– Linearity of the relationship between dependent and independent variables
– Statistical independence of the errors
– Homoskedasticity or constant variance of the errors
– Normality of the error distribution
 
2
1
2
2
1
1
0
1
2
)
...
(

 








n
i
ik
k
i
i
i
n
i
i x
a
x
a
x
a
a
y
e
S

Linear Regression
 
 
 
e
e
e
i
X
f
i
y
a
a
a
a
x
a
a
y
X
f
x
a
x
a
x
a
a
y
i
i
i
k
k
i
k
k
k
k


















)
S(
);
(
)
(
)
S(
:
function
Error
parameter
model
]
,...,
,
,
[
structure
model
)
:
(
...
2
2
2
1
0
1
0
2
2
1
1
0






X
y
form
in the
written
be
can
model
regression
linear
the
notation,
compact
With this
1
1
1
X
,
2
1
1
0
1
2
21
1
11
2
1
e
e
e
e
e
a
a
a
x
x
x
x
x
x
y
y
y
y
n
p
np
n
p
p
n







































































.
estimators
(OLS)
squares
least
ordinary
the
as
called
usually
or
,
estimators
regression
direct
the
called
is
equation
the
of
solutions
The
X)
X
(
estimates
final
get the
we
solving
By
)
(
)
(
squares
of
sum
residual
the
minimize
that
values
parameter
the
finding
by
parameters
the
estimates
regression
Linear
1
-
1
2
y
X
X
y
X
y
e
e
e
n
i
i
















•
• is defined for training data.
• We are really interested in finding the best predicts y on
future data, i.e., minimising sum of squared error where the
expectation is over future data.
• This is known as Empirical learning which is based on data on
experiment. We are interested not only to minimise on the
training data and but also to get the best prediction on
unknown future data.
value
predicted
model
value
Actual
)
( 


S
)
(
S

• The usual assumption is the way that past data behaved future data will
also behave similarly.
– If we have a model which minimises error on past data it will also
minimise the error on future data.
– If training data is large the model is simple, we are assuming that the
best f on training data is also the best predictor f on future test data.

Limitation of Linear regression
• True relationship of X and Y might be non-linear
– suggests generalisations to non-linear models
• Complexity:
– cost of computational operation and time complexity increases with
number of attributes
• Correlation/Co-linearity among the X variables
– can cause numerical instability ( inverse does not exist if matrix is not
a full ranked)
– problems in interpretability (identifiability: determining whether the
model true parameters may be recovered from the observed data)
• Includes all variables in the model..
– But what if attributes =1000 and only 3 variables are actually related
to Y?

Complexity vs. goodness of fit
• Suppose the regression model is a
linear and it is too simple
– Simple model does not fit the data
well has large training set error
– A biased solution
• Suppose large data on training data
itself makes model more complex
nonlinear regression model
– Complex model has low training set
error but high error on future points
causes overfitting
– Small changes to the data, solution
changes a lot
– A high--‐variance solution

• Occam’s Razor principle (Principle of Parsimony):
– The principle states that "Entities should not be multiplied
unnecessarily.“
– "when you have two competing theories that make exactly the
same predictions, the simpler one is the better."
– Use the simplest model which gives acceptable accuracy on
training set –do not complicate the model to overfit the training
data
• Choose the model which sacrifice some training set errors for
better performance on future sample.
• Penalize complex models based on
– Prior information (bias)
– Information Criterion (MDL, AIC, BIC)

Bias and variance for regression
• For regression, we can easily decompose the error of the learned model into two
parts: bias (error 1) and variance (error 2)
• Bias:
– The difference between the average prediction of our model and the correct
value which we are trying to predict.
– How much does the mean of the predictor differ from the optimal predictor
• Variance:
– The variability of model prediction for a given data point or a value which tells
us spread of our data.
– How much does the predictor vary about its mean for different training
datasets
– The variance of a learning algorithm is a measure of its precision. High
variance error of a model implies that it is highly sensitive to small
fluctuations.

Linear Regression for Data Mining Application

Training and Test Error
• Given a dataset, training data used to fit the parameters of the
model. Training data choose a loss function e.g., squared error for
regression.
• The training error is the mean error over the training sample.
• The test (or generalization) error is the expected prediction error
over an independent test sample.
• Prediction error or true (generalization) error (over the whole
population) is for the target performance measure, i.e.,
performance on a random test point (X,Y).
• Training error is not a good estimator for test error.

Model Complexity and Generalization
• A models ability to adapt to patterns in the data, we call the model
complexity.
• A model with greater complexity might be theoretically more accurate
(i.e., low bias).
– But you have less control on what it might predict on a tiny training
data set.
– Different training data sets will result in widely varying predictions of
same test instance.
• Generalization ability: We want good predictions on new data, i.e.,
‘generalization’. What is the out-of-sample error of learner f ?
• Training error can be reduced by making the hypothesis more sensitive to
training data, but this may lead to overfitting and poor generalization.

Model Selection and Assessment
• When we want to estimate test error, we may have two different goals in
mind:
1. Model selection: Estimate the performance of different hypotheses or
algorithms in order to choose the (approximately) best one.
2. Model assessment: Having chosen a final hypothesis or algorithm,
estimate its generalization error on new data.
• Trade-off between bias and variance:
– Simple Models: High Bias, Low Variance
– Complex Models: Low Bias, High Variance

• Thus, a designer is virtually always confronted to the following
dilemma:
– On one hand, if the model is too simple, it will give a poor
approximation of the phenomenon (underfitting).
– On the other hand, if the model is too complex, it will be able to
fit exactly the examples available, without finding a consistent
way of modelling (overfitting).

• Choice of models balances bias and variance.
– Over‐fitting  Variance is too High
– Under‐fitting  Bias is too High

Training, Validation and Test Data
• In a data-rich situation, the best approach to both model selection and
model assessment is to randomly divide the dataset into three parts:
1. A training set used to fit the models.
2. A validation set (or development test set) used to estimate test error for
model selection.
3. A test set (or evaluation test set) used for assessment of the
generalization error of the finally chosen model.

• Training: train different models
• Validation: evaluate different models
• Test: evaluate the accuracy of the final model
The trained model can then be used to make predictions on
unseen observations

Linear Regression for Data Mining Application

More Related Content

Similar to Linear Regression for Data Mining Application (20)

Recently uploaded (20)

Linear Regression for Data Mining Application