2. Learning Objectives
- Understand linear regression
- Understand purpose of lm() function
- Using lm() to fit regression models
- Interpret output of lm() function
3. Ever Wondered How…
- Maps are able to estimate your travelling time
- Surcharge pricing are determined to meet
demands for taxi
- HDB resale prices are forecasted
4. What is Linear Regression?
- Interested in the relationship between a dependent variable (y)
and one or more independent variables (x)
- Models relationships between variables
- Simple Linear Regression (1 independent variable)
- Multiple Linear Regression (2 or more independent variables)
- Finds best-fit line that minimises distance between observed
data values and predicted values
6. Ordinary Least Squares
- Find the best-fit line want the line to be as
⇒ close to data points as possible
- Minimise vertical distance between each point to line
- Residual Sum of Squares (RSS) ⇒ squared sum of residuals for all data
points
- Squared as we do not want residuals to “cancel off” one another
Minimise RSS
Minimum total
distance
between line
and points
Best-fit line
7. Visualised on a Simple Plot
Actual Data Points
Residuals
Regression Line
8. How to We Use R to plot Linear Regression Model?
#10:Simple linear regression (SLR) (black cherry trees)
#13:Min ⇒ represents the data point furthest below the regression line
1Q ⇒ 25% of the residuals are less than this number
3Q ⇒ 25% of teh residuals are greater than this number
Max ⇒ point that is furthest from the regression line
Mean of residuals not shown as it will always be 0
Now, what can we infer just from the residuals alone to determine if this can be a good linear regression model?
Median of residuals ideally should be as close to 0 as possible (hard to preface what is close or far as it is relative to your data)
Why ⇒ this would imply our model isnt skewed one way or another
Another would be that it is symmetrically distributed ⇒ want min and max to have same magnitude, as well as 1Q and 3Q
#15:Moving on to coefficients, the intercept will always be given, including any x variables that we have provided ⇒ y=mx+c
#17:Standard error ⇒ responsible for the width of the confidence interval
Ideally, you want a lower number relative to the estimate for standard error (e.g. standard error is small compared to est)
#21:We use 0.05 as a benchmark, it means that it is unlikely that the relationship between the y variable and x variable is due to chance ⇒ statistically significant
Stars on the right shows the significant levels, as seen in the significant codes
#23:Standard deviation of residuals ⇒ average distance of points from the regression line, adjusted with the number of points
Degrees of freedom ⇒ number of observations (31) minus number of variables (29)
#25:Multiple R-squared ⇒ Generally gets better with more x variables
Adjusted R-squared ⇒ Penalises for adding useless predictor (x) variables
If multi R-squared is much higher than your adjusted, your model might be overfitted
#27:Want a value way bigger than 1 to be statistically significant, or you can just refer to p-value where if it is <0.05 it would be significant